Spatial Attention Guided Residual Attention Network for Hyperspectral Image Classification

Hyperspectral image (HSI) classification has become a research hotspot. Recently, deep learning-based methods have achieved preferable performances by which the deep spectral-spatial features can be extracted from HSI cubes. However, in complex scenes, due to the diversity of the types of land-cover and the high dimensionality of bands, these methods are often hampered by the irrelevant spatial areas and the redundant bands, which results in the indistinguishable features and the restricted performance. In this article, a spatial attention guided residual attention network (SpaAG-RAN) is proposed for HSI classification, which contains a spatial attention module (SpaAM), a spectral attention module (SpeAM), and a spectral-spatial feature extraction module (SSFEM). Based on the spectral similarity, the SpaAM is capable of capturing the relevant spatial areas composed of the pixels of the same category as the center pixel from HSI cube with a novel inverted-shifted-scaled sigmoid activation function. The SpeAM aims to select the bands which are beneficial to the spectral feature representation. The SSFEM is exploited to extract the discriminating spectral-spatial features. To facilitate the processes of bands selection and features extraction, two well-designed spatial attention masks generated by the SpaAM are employed to guide the works of the SpeAM and the SSFEM, respectively. Moreover, a spatial consistency loss function is installed to maintain the consistency between the two spatial attention masks so that the network enables the distinction of the relevant features exactly. Experimental results on three HSI data sets show that the proposed SpaAG-RAN model can extract the discriminating spectral-spatial features and outperforms the state-of-the-arts.


I. INTRODUCTION
Technological advancements of hyperspectral sensors and aircraft enable hyperspectral images (HSIs) to characterize the meticulous spectral features from the visible to the shortwave infrared wavelength ranges of land-cover with hundreds of contiguous bands. Furthermore, the improved spatial resolution of sensors brings HSI with possible richer spatial structures [1]. HSI classification, which aims to assign one certain category for each pixel using its spectral and spatial features [2], has captured attention increasingly in the field of HSI analysis [3][4][5][6]. It has shown great importance in remote sensing applications, such as precision agriculture, mineralogy, military reconnaissance, etc.
During the past decades, numerous classification methods based on the spectral characteristic have been conducted to classify the pixels in HSI data, including support vector machines (SVMs) [7,8], k-nearest neighbors [9], random forest (RF) [10], logistic regression [11], extreme learning machine [12], etc. However, the classification results tend to be unsatisfactory when these methods encounter the pixels with very similar spectral features but not the identical label, since the advantage of spatial information has seldom been considered. Afterwards, quite a few methods have employed spatial features as an auxiliary means on the basis of spectral features to enhance the representation of hyperspectral data. For instance, to refine the classification maps predicted by SVMs, Markov random field and edge-preserving filtering are applied to take spatial contextual information into consideration, respectively [13,14]. Morphological profile [15][16][17], which is declared to be an efficient way to explore the spatial information, has been extended to adapt the spatial feature extraction of high dimensional hyperspectral data. In [18] and [19], the spatial information of the neighborhoods of each pixel is delivered to a sparse representation model to gain the optimal representation strategy with a set of common training samples. Furthermore, other approaches, such as Gabor filtering [20], compressive sensing [21], and discriminant analysis [22,23], are also applied for HSI classification with the aid of the spectral and spatial features.
Although the aforementioned methods have achieved acceptable performances, the classification accuracies of them heavily depend on the quality of the hand-crafted features which are considered as the shallow features. Owing to the Hughes phenomenon [24] and the limited samples, the results of above-mentioned shallow models are prone to overfitting. Moreover, there is generally intense spectral variability in HSI due to the capricious environmental factors, which causes the large intraclass distance and the serious interclass similarity. Consequently, hand-crafted features are no longer appropriate to deal with these problems. Extracting the robust spectral and spatial features for HSI classification has been a widely recognized demand by the related industry.
With the increasing computational ability of hardware, deep learning (DL) have made tremendous breakthrough in computer vision (CV) tasks (e.g., image classification [25,26], scene segmentation [27], target detection [28]) and natural language processing [29], etc. A variety of DL-based attempts have been performed to process hyperspectral remote sensing images, which can be divided into two general categories, i.e. spectral-based methods and spectralspatial-based methods, according to the type of the processed information. The spectral-based methods exploit spectral information only. For example, a full dimensional spectrum was input into the artificial neural network to discover the subtle spectral differences between different classes [30]. Instead of processing each band independently, a recurrent neural network (RNN) was utilized to take full advantage of the spectral correlation exists in the particular bands [31]. In order to alleviate the computational burden from the redundant information between the neighboring bands and increase the classification accuracy, a cascaded RNN model consists of two RNN layers was proposed, in which one aims to reduce redundancy whereas the other is used to learn complementarity [32]. With the trait of long-term dependence, the gradients may fade away during the training phase of RNN. Therefore, to handle this shortage, long shortterm memory networks, as an extended version of RNN, are proposed to gain the contextual spectral features effectively [33,34]. Although these spectral-based methods have improved the classification accuracy, there is still much room for improvement in their performances in complex scenes.
Different from the former, the spectral-spatial-based methods aim to extract the spectral and spatial features simultaneously for classification. Up to now, many studies have been carried out on this thinking. In [35], principle component analysis (PCA) was used to compress the spectral dimension, then every pixel and the flatten vector of the corresponding neighborhoods were sent to the multi-layer stacked autoencoders to extract spectral and spatial features, respectively. Reference [36] proposed a spatial updated deep autoencoder in which the contextual information was considered to maximize the interclass distances during feature learning. Besides, deep belief networks were also applied to capture the representative spectral features and count the statistics of neighboring pixels [37,38]. However, these models mostly transform the spatial inputs into the flat vectors, which may destroy the spatial structure.
With the unique advantages of local perception and parameters sharing, convolutional neural networks (CNNs) have been demonstrated the power of feature extraction and dominated the field of HSI classification. The classical twobranch CNN architectures, including 1-D CNN and 2-D CNN, were designed to extract the spectral and spatial features, then accomplished the classification via the feature fusion or decision fusion strategies [39][40][41]. To further reserve the complete spectral-spatial information, an HSI cube which contains the center pixel and its neighborhoods was picked as the training sample of network. Such an approach assumes that the label of entire HSI cube can be represented by the label of center pixel because of the intensive spectral similarity existing between the center pixel and the surrounding pixels in a small region. Supported by this hypothesis, 3-D CNN has been the most appropriate network to fully extract the spectral-spatial features [41]. Moreover, 3-D CNN was united with Jeffries-Matusita distance to select effective bands for the recognition of very similar objects [42]. Aiming to address the issues of massive parameters and long-term training, those 3-D convolutional layers at deep positions are substituted by 2D convolutional layers to simplify networks and fuse features at different levels effectively [43,44]. As we all know, the deeper the network is, the more abstract and representative the features extracted. However, the deeper network may result in the vanishing gradient. To resolve this problem, a residual network (ResNet) was proposed to propagate the gradient from high layers to low layers quickly via the shortcut connections in residual blocks [45]. Zhong et al. [46] designed a deep spectral-spatial ResNet which contains serial residual blocks to alleviate the declining-accuracy phenomenon. In [47], a pyramidal bottleneck residual block was proposed to involve more feature map locations in the deeper network. Zhang et al. [48] combined the spectralspatial fractal ResNet with data balance augmentation to improve the recall rates of the small samples. In addition, to enhance the robustness of model in unusual scene, a dualchannel ResNet with a noise-robust loss function was proposed to fully utilize the useful information from mislabeled samples [49]. With the aforementioned great progress, ResNet has been the mainstream architecture of the spectral-spatial-based methods for HSI classification.
However, there still exists a common drawback that has yet to be resolved. HSI usually contains abundant spectral and spatial information, whereas not all of them are beneficial to the identification [50]. In other words, the spectral bands and the salient spatial regions, which are beneficial to feature representation and classification, are supposed to be emphasized. To this end, attention mechanism, which is well received in nature machine translation [51] and CV tasks [52,53], has been introduced to capture the most salient bands and positions in HSIs. Among the related applications, attention mechanism generally is embedded into the networks as an independent block to refine the feature maps by weighting bands, pixels, or channels unequally. For example, in the early stages, the lightweight spectral attention modules composed of the global average pooling (GAP) layers and the convolutional layers were placed at the beginning of the networks to promote the influential spectral bands to play a primary role in the subsequent feature extraction [54,55]. Besides the spectral attention module, the spatial attention module was also proposed to enhance the significance of the relevant spatial regions. For example, Shamsolmoali [56] et al. employed a spatial attention module to increase the discriminating ability of network during the feature fusion. By embedding the spectral attention and spatial attention modules into the residual blocks sequentially, the useful spectral-spatial features are obtained to improve the classification performances [50,[57][58][59]. However, the spectral and spatial attention modules in the abovementioned methods are processed independently, which hinder the complementation of spectral and spatial properties. In order to strengthen the correlation between spectral and spatial attentions, similarity matrices generated by the spectral and spatial attention branches were distributed to all locations and bands adaptively [60]. Specially, Li [61] et al. proposed a spectral and spatial fused attention module to apply the attention masks crosswise, which aims to fully explore the correlation between spectral bands, spatial positions, neighborhoods and the prediction results. In addition, self-attention was also adopted to explore the correlations between pixels. A self-attention (SA) model was designed to extract discriminating spectral and spatial features [62]. To enhance the effect of the center pixel, reference [63] proposed a method to compute the correlations between the center pixel and its neighborhoods. By integrating multiple SA modules, the spectral and spatial transformers [64,65] were employed to model the correlation between the spectral bands and spatial locations. The transformer was also used to gain the optimized inputs for the subsequent processing [66]. However, the computational cost is enormous as there are generally several SA modules in the transformer. In [67] and [68], the global salient spectral bands and spatial areas are extracted by the spectral and spatial non-local blocks. Both are embedded into the spectral and spatial modules to refine the features, respectively. pixel and its neighborhoods are cropped into an HSI cube to extract the spectral-spatial features. However, as shown in the corresponding ground truth images, there may be more than one type of land-cover in an HSI cube. Those pixels which have the same category as the center pixel are named as homogeneous pixel, whereas the rest are called interfering pixels. For convenience, the unlabeled pixels, i.e. the black pixels in ground-truth images, are assumed to the interfering pixels as well. In an HSI cube, homogeneous pixels form the relevant spatial areas, which are beneficial to the spectral-spatial feature extraction.
Although the aforementioned models have somehow improved the classification results, there is a common deficiency that the extracted spatial attention may not focus on the positions related to the center pixel, especially in the scene of various junction land-cover. As shown in Fig. 1, there may be more than one type of land-cover in an HSI cube. But only the pixels which belong to the same category as the center pixel (marked with a yellow dot) are worthy of highlighting. In this article, these pixels are named as homogeneous pixels which are surrounded by green polygons, whereas the rest are named as interfering pixels surrounded by purple polygons. As the spectral characteristics are fully or partially different from that of the center pixel, interfering pixels may mislead to the irrelevant spatial regions and restrict the extraction of distinguishable spectral-spatial features to some extent. On the contrary, homogeneous pixels, which express the similar spectral properties with the center pixel, can lessen the impact of large intra-class variety and promote features aggregation. Therefore, the inherent functional differences between the homogeneous pixels and the interfering pixels should be fully considered for a better classification performance.
In order to achieve the above-mentioned purpose, a spatial attention guided residual attention network (SpaAG-RAN) is proposed to highlight the relevant spatial areas and extract the discriminating spectral-spatial features for HSI classification. The proposed model is mainly composed of the spatial attention module (SpaAM), the spectral attention module (SpeAM), and the spectral-spatial feature extraction module (SSFEM). Based upon the spectral similarities between the center pixel and its neighborhoods, the SpaAM is performed to generate the spatial attention masks efficiently which can represent the spatial distribution of homogeneous pixels and interfering pixels. Similarly, the SpeAM is designed to explore the spectral attention mask, homogeneous pixels interfering pixels center pixel HSI HSI cubes ground-truth image which can be interpreted as an adaptive band selector. The SSFEM is a 3-D CNN with residual blocks, which takes charge of extracting the spectral-spatial features for classification. Among the three modules, the SpaAM guides the other two modules so as to strengthen the effects of the relevant spatial areas. Specifically, in the SpaAM, one spatial attention mask is exploited to encourage the homogeneous pixels to contribute more for the selection of discriminating bands, whereas the other is used to suppress the interfering pixels from the feature maps extracted by the SSFEM. Besides, to identify the homogeneous pixels and the interfering pixels during the spectral-spatial feature extraction, a spatial consistency loss function is utilized to maintain the consistency of spatial attention masks generated before and after the SSFEM. Experimental results on three public HSI data sets demonstrate the effectiveness of the SpaAM and the superior classification performance of the proposal.
The main contributions of this article are as follows. 1) A lightweight spectral similarity-based SpaAM is designed to capture the relevant spatial areas, which describes the spatial distribution of homogeneous pixels and interfering pixels implicitly. In this module, the spectral similarities between the center pixel and its neighborhoods are measured by the efficient Euclidean distance. A novel inverted-shifted-scaled sigmoid activation function is then in charge of converting the similarities to the proper spatial weights.
2) To improve the classification performance, a spatial consistency loss function is conducted to enable the SSFEM to extract effective features by preserving the specificity of homogeneous pixels and interfering pixels.
3) An end-to-end SpaAG-RAN model, which incorporates the SpaAM, the SpeAM, and the SSFEM, is proposed to stress the relevant spatial areas and extract the discriminating spectral-spatial features for HSI classification.
The remainder of this article is organized as follows. Section II introduces the proposed SpaAG-RAN model in detail. Section III presents the experimental results and analyses on three classical data sets. Finally, this article is concluded in Section IV.

II. METHODLOGY
In this section, the overview of the proposed SpaAG-RAN model is first introduced. Then, the core components of network, including the SpaAM, the SpeAM, and the SSFEM, are described at length. Finally, the loss functions of the network and the optimization processes are given.

A. Framework of the Proposed Network
Suppose that the HSI data set contains labeled pixels and their corresponding one-hot label vectors , where h, w, b, and C represent the height and width of spatial dimension, the number of bands, and the number of categories, respectively. Previous researches [41,42] have demonstrated the effectiveness of spectral-spatial features for classification. Therefore, a square box with the width of is borrowed to crop the center pixels and their adjacent pixels to form the HSI cubes . The label of the th HSI cube is assumed to be , i.e. the label of the center pixel. After these preprocessing, the combination of all HSI cubes and the corresponding one-hot label vectors ( , ) forms the sample set. In this work, a certain proportion of samples are selected randomly from each land-cover category to train the network, whereas the rest are used as validation set and test set. Fig. 2 shows the workflow of the proposed SpaAG-RAN model, which is mainly composed of the SpaAM, the SpeAM, and the SSFEM. The inputs are an HSI cube and the corresponding true label . First, is fed into the SpaAM to gain the incipient spatial attention mask , which is utilized to highlight the homogeneous pixels and suppress the interfering pixels in . Next, the spectral attention mask , which contains the influential spectral bands for distinguishing the center pixel from the interfering pixels, is extracted by the SpeAM from the calibrated HSI cube . With the uneven weighting operation of , the contributory bands in are emphasized. Then, the processed HSI cube is transported to the SSFEM to extract the discriminating spectral-spatial features , where and are the number of channels (i.e. convolutional filters) and the number of reduced bands. Before the classification, is input to the SpaAM to acquire the terminal spatial attention mask , which is used to suppress the interfering pixels in and compute the spatial consistency loss with the incipient spatial attention mask . Finally, fully connected (FC) layer map the refined features to the classification space and predict the most possible label with the softmax activation function.
During the training process, the proposed network optimizes the parameters with the cross-entropy and the spatial consistency loss functions. The cross-entropy, as the universal loss function for the classification problem, is adopted to minimize the cross-validation error between the true label and the predicted label . Moreover, to maintain the stability of the spatial distribution of homogeneous pixels and interfering pixels, the spatial consistency loss function is installed to monitor the variation between the two spatial attention masks, i.e. and . The details of the above three modules and the optimization of loss functions are illustrated as follows.

B. SpaAM
The SpaAM is designed to capture the spatial areas relevant to the center pixel. These spatial areas are promoted to take a leading role during the spectral attention generation and feature extraction. In order to reach these purposes, a The SpeAM aims to discover the discriminating bands for the better spectral feature representation. The SSFEM is exploited to extract the deep spectral-spatial features. During the back propagation procedure, the cross-entropy loss function and the spatial consistency loss function are adopted to optimize the parameters of the network. "" denotes the element-wise multiplication.

Figure 3. SpaAM. It contains a 3-D convolutional layer, a similarity calculation layer, and an inverted-shifted-scaled sigmoid activation function.
natural idea is to analyze the spectral similarities between the center pixel and its neighboring pixels. The higher the similarity between one pixel and the center pixel is, the more possible they belong to the same category. Thus, this pixel should be assigned a greater weight to be emphasized. Considering that the spectral characteristics of land-cover in HSI data generally vary frequently under the influence of various environmental conditions (e.g. temperature and humidity), the Euclidean (L2) distance, which is not sensitive to the sharp variations [69], is adopted to perform the process.
The architecture of the SpaAM is shown in Fig. 3. Given an HSI cube , the SpaAM aims to generate the spatial attention mask (for convenience, the incipient spatial attention mask is described as an example). First, the convolutional layer with a kernel is employed to reduce the channels of the input to one: (1) where is the single-channel feature map, and are the convolutional kernel and bias of the convolutional layer, separately. " " is the convolutional operator.
Then, the L2 distance is adopted to evaluate the similarities ∈ between all pixels in and the center pixel which is copied from . The similarity at position (i, j) is calculated by For a pixel, the higher the similarity it gains, the more it contributes to the classification, and vice versa. Considering the range of the values of is [ , the lower value represents the higher similarity. An inverted-shifted-scaled sigmoid activation function is proposed to allocate a proper weight for each pixel. As shown in Fig. 4, compared with the standard sigmoid activation function, the distribution of the weights of is adjusted by via flipping, panning, and zooming: (3) where is the scale to regulate the ceiling of the weights of pixels, corresponds to the similarity value when the weight is 0.5 which is the threshold to divide the similarity intervals. The pixels, which similarity is in the range of [ , are regarded as the homogeneous pixels, whereas the other pixels with the similarity in the range of are seen as the interfering pixels.
After the above processes, an element-wise multiplication across the spatial dimension between the mask and the HSI cube is conducted to stimulate the homogeneous pixels to contribute more for the selection of the important spectral bands: Similarly, following the SSFEM, the SpaAM is also used to extract the terminal spatial attention mask from with SpaAM SpeAM SSFEM SpaAM

FC Layer Softmax
Conv3D Similarity s extract the identical processes. With the aid of , the interfering pixels of the output of the SSFEM are weakened by  . (5)

C. SpeAM
The SpeAM is designed to select the discriminating spectral bands which are beneficial to the spectral feature representation of the center pixel. Similar to the SpaAM, the spectral attention mask , which gives a particular weight for each band, is generated by the SpeAM. The structure of the SpeAM which is borrowed from the previous research [70] is shown in Fig. 5. In the input of the SpeAM, the calibrated HSI cube , the homogeneous pixels are highlighted and the interfering pixels are weakened. Therefore, a global max-pooling (GMP) layer with the pooling size instead of the GAP layer of the original architecture is first exploited to retain the most notable and germane information in each spectral band. The maximum element of the ith band of is calculated by (6) where is the output of the GMP layer. Then, is delivered to an MLP to explore the collaborative and exclusive relationships between spectral bands. The MLP contains two FC layers. The first FC layer aims to reduce the information redundancy and compress the critical spectral features, whereas the second one converts the abstract compressed features to the spectral attention mask with the aid of the sigmoid activation function (7) where ∈ , ∈ , ∈ , and ∈ are the weight parameters and the biases of the first and second FC layers under the compression ratio of .
Finally, an element-wise multiplication across the spectral dimension between the mask and the HSI cube is conducted to emphasize the discriminating spectral bands for the spectral-spatial feature extraction as follow where is the HSI cube after the bands enhancement.

D. SSFEM
The SSFEM is built to extract the deep spectral-spatial features for classification. The architecture of the SSFEM is based on the CNN, which has been the most popular network for many CV tasks. Generally, CNN contains convolutional layers, activation functions, and pooling layers. The convolutional layers are in charge of feature extraction. The activation functions are followed to the map features to the nonlinear space. The pooling layers are mainly used to compress the feature maps.
In a classic CNN, the ith feature map of the lth layer can be formulated as follows: (9) where is the th output feature map of the ( )th layer, and are the th convolutional kernel and the bias of the lth layer, respectively. " " is the convolutional operator. is an activation function, such as rectified linear unit (ReLU) [71], sigmoid function, and hyperbolic tangent function. In this article, ReLU is employed due to its advantages in efficient gradient propagation and sparse activation.
However, when processing the HSI data, the deeper the CNN is, the accuracies decrease. This phenomenon occurs because the classification errors from deep layers cannot be propagated back precisely, which results in the vanishing gradient. To overcome this problem, ResNet [45], is proposed to add a shortcut connection between the input volume and the output volume, which enables the network to stacks with any depth. Thus, the vanishing gradient is alleviated and the network could be optimized easily.  In view of the above-mentioned inimitable advantages of ResNet, it is adopted as the basic architecture of SSFEM. As shown in Fig. 6, by receiving the refined output of the SpeAM, i.e. , the SSFEM contains three residual blocks is exploited to extract the spectral-spatial features . In this module, residual blocks equipped with the convolutional kernels of are connected in series to learn the deep feature representations. The numbers of convolutional kernels in three residual blocks are { , , }, respectively. As shown in Fig. 7, the residual block can be represented briefly as follow: Res. k 1 @3×3×3 (k 1 , , ,(b/2)) MP 1×1×2 Res. k 2 @3×3×3 (k 2 , , ,(b/4)) Res. k 3 @3×3×3 (13) where and are the input and output of residual block, , , and are the feature maps, the convolutional kernels, and the biases of the th layer, respectively. represents a series of operations (in dotted rectangle), including convolution and activation. Note that a convolutional layer with a kernel in size of is used in the shortcut connection to match the dimensions between and . Behind the first two residual blocks, the max pooling (MP) layers are followed to stress the intensive information and reduce redundancy. The average pooling (AP) layer is set after the last residual block to remain as much semantic information as possible [45]. For all pooling layers, the sizes and strides are all set to to preserve more spatial information.
Finally, the deep spectral-spatial features is obtained by (14) where represents the SSFEM. Table I displays the details of the layers in the proposal.

E. Loss Functions and Optimization
To train the proposed SpaAG-RAN model effectively, the classification loss function is exploited together with the spatial consistency loss function to optimize the parameters.
Cross-entropy, as a popular loss function for the classification problems, is adopted to minimize the loss ∑ ∑ (15) where and are the true and predicted one-hot label vectors, respectively. is the number of classes, is the number of samples in a batch, and denotes the scalar of the th class of the th sample.
The spatial attention masks, and , express the correlations between the center pixel and its neighborhoods implicitly. By preserving the consistency of the two spatial attention masks during the convolution, the feature extraction ability of the SSFEM to distinguish the homogeneous pixels from the interfering pixels is enhanced. To achieve the above-mentioned goal effectively, the mean absolute error is employed to measure the variation between the two masks, i.e. and . The complete spatial consistency loss function on a batch is defined as follow: Therefore, the total loss can be formulated as (17) where controls the relative importance of the two functions. During the training procedure, the backpropagation and gradient descent algorithm are used to update parameters.

III. EXPERIMENTS AND ANALYSES
In this section, the details of three HSI data sets [72] collected by different imaging sensors, including Indian Pines (IP), University of Pavia (UP), and Botswana (BW), and the experimental configuration are described at length first. Then, the parameters setting of the network, the ablation study, and the comparison between the proposal and the state-of-the-art methods are reported and discussed. Finally, the visualization of the spatial attention masks and feature maps is presented and analyzed.

A. Data Sets and Experimental Configuration
The IP data set is gathered by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over Indian Pines test site in Northwestern Indiana in 1992. It consists of 145 × 145 pixels and 224 spectral bands in the wavelength range from 0.4 to 2.5 μm. The spatial and spectral resolutions are 20 meters/pixel (m/p) and 10 nm. After removing 20 bands covering the region of water absorption and four zero-bands, the remaining 200 bands are used for experiments. The falsecolor image of the IP data set and its ground-truth (GT) are shown in Fig. 8 (a) and (b). As illustrated in Table II, 15%, 5%, and 80% of the labeled pixels are selected randomly from each of the 16 land-cover categories as the training, validation, and test sets, respectively. The UP data set is acquired by the Reflective Optics Imaging Spectrometer (ROSIS) sensor during a flight

Module
Layer Name Kernel Size Strides Connected to Output Shape Multiply_3 --AP, ISS_Sigmoid_2 (k 3 , , ,b /8)  Table III, 5%, 5%, and 90% of the labeled pixels are selected randomly from each of the 9 land-cover categories as the training, validation, and test sets, respectively. The BW data set is collected by the Hyperion sensor mounted on the Earth Observing-1 (EO-1) satellite over the Okavango Delta, Botswana, on 31 st May, 2001. It consists of 1476 × 256 pixels and 242 spectral bands in the wavelength range from 0.4 to 2.5 μm. The spatial and spectral resolutions are 30 m/p and 10 nm. By removing the uncalibrated and noisy bands which cover water absorption features, there are 145 bands remained. The false-color image of the BW data set and the corresponding ground-truth are shown in Fig. 8(e) and (f). As illustrated in Table IV, 15%, 5%, and 80% of the labeled pixels are selected randomly from each of the 14 land-cover categories as the training, validation, and test sets, respectively.
The experiments on the above three data sets are performed on a computer with an AMD Ryzen 3600 at 4.07 GHz × 6 with 32-GB RAM and a NVIDIA GeForce GTX   [73] with the learning rate of 0.001 and the delay factor of 0.9. The weights and biases of all layers in the proposed model are initialized by Xavier normal distribution [74]. The batch size is 16 and the total number of training iteration is 200.
In order to quantify the classification performance of the proposed SpaAG-RAN model, the overall accuracy (OA), average accuracy (AA), and kappa coefficient (κ) as the evaluation measures. The higher scores they get, the superior No.

B. Parameters Setting
The structures of the DL-based models are generally complex. Models can have many hyperparameters and finding the best combination of parameters can be treated as an optimization problem. In this section, five parameters are analyzed to optimize the proposed model, including (1)

1) SCALE AND THRESHOLD OF THE INVERTED-SHIFTED-SCALED SIGMOID ACTIVATION FUNCTION
The inverted-shifted-scaled sigmoid activation function aims to assign a rational spatial weight for each pixel. The scale determines the range of weight whereas the threshold can be regarded as the boundary between the homogeneous pixels and the interfering pixels. In order to ascertain the correlation between the two parameters and the classification performance, the two parameters, and , of inverted-shiftedscaled sigmoid activation function are set to {1, 2, 5, 10, 20, 50, 100, 500} and {0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}, respectively. The surface charts of the OAs on three data sets are shown in Fig. 9. Taken the IP data set as an example, from the axis of , the OAs are not good enough when the value of is less than 0.2 or larger than 0.6. From the axis of , the OAs are also in an inferior position as is less than 10. However, as moves away from the both ends gradually, the larger than 10 obtain the superior classification performance.
Analyzing the above-mentioned results with reference to the inverted-shifted-scaled sigmoid activation function shown in Fig. 4(b), several conclusions can be derived as follows. First, when the value of is smaller, although most interfering pixels are shielded, a large part of homogeneous pixels are also treated as interfering pixels, which causes the inadequate feature extraction. Second, when the value of is close to 1, the pixels involved in classification contain not only homogeneous pixels but also many interfering pixels, which also influences the classification accuracies. Last, the larger value of enables the ceiling of the weights of homogeneous pixels to approach 1 and clarifies the boundary between the two kinds of pixels, which are both beneficial for classification. The data points {(20, 0.3), (10, 0.4), (20, 0.5)}, which are marked with the yellow ellipses corresponding to the maximum OAs on the IP, UP, and BW data sets, respectively. The SpeAM aims to strengthen the discriminating spectral bands. The MLP plays a key role in the dimensional reduction of features and nonlinear mapping between the spectral bands. To preserve the important spectral information and reduce the redundancy, it is necessary to adjust an apposite compression ratio. In this part, the effect of the compression ratio in the MLP of SpeAM is analyzed. As shown in Fig. 10, the OAs of the proposed model on three data sets all reach the peak when the ratio is 2. When there is no compression (i.e. the ratio is 1), the OAs remain a lower level. In addition, as the compression ratio increases from 2 (a) (b) (c) to 10, the declining-accuracy phenomenon occurs. This is because the reduced dimension leads to more spectral information to be abandoned gradually. However, the decline of the OAs on the IP data set is the most intense on three data sets. One pertinent reason is that the two hundreds of bands of the IP data set give rise to the more loss of spectral information under the same compression ratio comparing with other two data sets. It has been approved by [25] there is a closer connection between the number of convolutional kernels and the representational capability of features. In order to extract the sufficient spectral-spatial features efficiently, six experiments using SSFEMs with different numbers of convolutional kernels are deployed to explore their influences on the classification performance. It can be seen from Fig. 11   The width of HSI cube has also a great effect on the classification performance. The larger width represents the more spatial information in HSI cube. But there may be more interfering pixels. Therefore, the HSI cubes with different widths {3, 5, 7, 9, 11, 13, 15} are input to the proposed SpaAG-RAN model to explore the proper widths. As shown in Fig. 12, when the widths of HSI cubes are 11, 5, and 7, the highest OAs are obtained on three data sets. The reason why the optimal widths of HSI cubes of the UP and the BW data sets are smaller than that of the IP data set most likely is that the spatial distributions of land-cover in the UP and the BW data sets are scattered and not as concentrated as the IP data set. On the other hand, the ranges of the undulation of the OAs on three data sets maintain below 1%, which demonstrates the robustness of the proposed model. This is due to the SpaAM can recognize the homogeneous pixels and interfering pixels precisely via the similarities between the center pixel and its neighborhoods. More important, the measurement of the similarities is independent of the width of the HSI cube. In this part, the performance of the proposal with different proportions of training samples is investigated. For each data set, {1%, 2%, 5%, 10%, 15%, 20%, 25%} of samples are randomly selected from each of land-cover categories as the training set. The experimental results are shown in Fig. 13. The OAs increase as the proportions of training samples on three data sets increasing. When the proportions of training samples of three data sets are more than 15%, 5%, and 15%, separately, the OAs will keep in high level.

C. Ablation Study
In this section, two ablation studies are carried out, including the combination of different modules and the weight of the spatial consistency loss function. For each study, the values bring the highest OAs on the validation set are adopted.

1) COMBINATION OF DIFFERENT MODULES
The proposed SpaAG-RAN model is composed of the SpaAM, the SpeAM, and the SSFEM. The SpaAM and the SpeAM aim to generate the spatial and spectral attention masks, whereas the SSFEM takes charge of extracting the deep spectral-spatial features. In order to explore the correlations between the three modules and the impacts of them on the classification performance, four schemes with different combinations of three modules are implemented on three data sets. The four schemes are as follows:  The OAs of these schemes on the test sets are presented in Table V. The numbers reported in bold-type denote the best results for each data set. It can be seen that the classification accuracies of Scheme_1 achieve an acceptable level even though they are less than other schemes. Compared to Scheme_1, Scheme_2 brings in the SpeAM to emphasize the contributory bands, which elevates the OAs on three data sets by no less than 0.4%. More inspiring, by installing the SpaAM on the SSFEM, the OAs of Scheme_3 on the IP, UP, and BW data sets grow to 97.81%, 98.64%, and 99.08%, respectively. Scheme_4 is a complete SpaAG-RAN model. It receives the best classification accuracies on three data sets comparing with the other schemes. In the proposed SpaAG-RAN model, the SpaAM can be seen as the guide of the whole network, which highlights the homogeneous pixels and restrains the interfering pixels via the spatial attention masks. By the guidance of the SpaAM, the products of the SpeAM and the SSFEM are ameliorated for better classification. Another study is conducted to confirm the availability of the proposed spatial consistency loss function and its contributions for classification by assigning different values to the weight . As shown in Fig. 14, when the value of is set to 0, i.e. the cross-entropy loss function works alone during the training process, the proposed model receives not less than 96%, 97%, and 97% OAs on three data sets, respectively. However, there are still rooms existed for improvement. After the spatial consistency loss function is installed, for the IP and UP data sets, the OAs reach the highest levels when the value of is set to 0.1 whereas the appropriate value of for the BW data set is 0.01. The reason might be the spatial information included in the BW data set is not as important as that in the other two data sets as shown in the corresponding ground-truth of them. With the further augment of the value of , the OAs start to reduce. Worse still, when the value of is set to 100, the OAs are even less than those of without using the function of . Fig. 15 shows the error and accuracy curves during the training procedures of three data sets when the weight of the spatial consistency loss is set to the optimal values and zero. From the upper parts of Fig. 15a-f, the errors of the total loss and the cross-entropy loss keep the steady levels with minute fluctuation when the number of iteration gets close to 200. For three data sets, the weights with the optimal values minimize the value of to the considerably low values. However, the weight with zero value results in in the high and gradually increasing levels, which destroys the consistency of the relevant spatial areas and causes the unsatisfactory classification performances.

2) WEIGHT OF THE SPATIAL CONSISTENCY LOSS FUNCTION
Therefore, the key to fully apply the advantages of the spatial consistency loss function for classification is to regulate the balance between it and the cross-entropy loss function precisely. The larger weight of may cause the larger deviation of parameters optimization as well as the convergence of network been disturbed. On the contrary, by assigning a smaller weight, the function of will play the auxiliary role during the training procedure, which is more appropriate for the HSI classification missions.

D. Comparison with Other Methods
To verify the effectiveness, the proposal is compared with two traditional classical methods: SVM with a radial basis function kernel and RF, and ten well discussed DL-based methods: 2-D CNN [41], 3-D CNN [41], spectral spatial residual network (SSRN) [46], spectral spatial attention network (SSAN) [62], center attention network (CAN) [63], double-branch multi-attention mechanism network (DBMA) [64], double-branch dual-attention mechanism network (DBDA) [65], 3-D cascaded spectral-spatial element attention network (CSSEAN) [59], residual spectral spatial attention network (RSSAN) [57], and rotation equivariant feature image pyramid network (REFIPN) [56]. For each method, the network from the original article is adopted. All methods share the same data sets (as illustrated in Section III-A) with the proposed model.

1) QUANTITATIVE COMPARISONS
The quantitative evaluations, including the recalls of each category as well as the means and standard deviations of OA, AA, and κ, are obtained by different methods on the IP, the UP, and the BW test sets. From Table VI   sets comparing with the traditional methods. For instance, SVM mistakes all the pixels of the categories of "Grasspasture-mowed" (No. 7) of the IP data set and the categories of "Bitu n" (No. 7) of the UP data set. Since the spectral information is used only, the SVM cannot fit the distributions of the two categories with limited samples. Similarly, RF, which relies on the elaborate hand-crafted features to finish the classification, also behaves the lower performances on three data sets. Second, comparing with 2-D CNN, the classification performance of 3-D CNN is improved on three data sets to some degree. This is because 2D-CNN only extracts the spatial features from the first principle component using PCA, whereas 3-D CNN exploits the rich spectral-spatial information for classification. Third, the other six being compared methods except RSSAN, which take 3D-CNN as baseline, achieve higher classification performance on three data sets. Specifically, in comparison with 3D-CNN, SSRN introduces the residual block, which improves the OAs of three data sets by 0.68%, 1.20%, and 1.36%, respectively. SSAN and CAN both employ the self-attention block to capture the spectral and spatial attention except that the latter emphasizes the center pixel during the attention   acquisition. Therefore, CAN behaves better than SSAN on each data set. However, all of the OAs of SSAN on three data sets are lower than SSRN which does not utilizes the attention mechanism. In Table VIII, the OA of SSAN is not even better than that of 3-D CNN. The most likely reason is that SSAN owns a big amount of parameters in feature extraction layers and attention modules, which brings challenge for the convergence of network under the condition of little samples. The performances of DBMA receive a certain degree of increase comparing with SSAN and CAN. DBDA, which is stated as the improvement of DBMA, designs the spectral and spatial attention modules based on the self-attention mechanism. However, the classification performances of DBDA are better than DBMA on the IP data set only. CSSEAN deploys the lightweight spectral and spatial element attention modules to refine the dimensionreduced spectral and spatial features, which decreases a number of parameters. Nevertheless, this may cause the erratic classification accuracy when the limited samples are employed. For example, on the IP data set, the number of the samples of "A f f ", "Gr ss-pasture-w d", and "O ts" (No.1, 7, and 9) are 7, 5, and 3, but the recalls of them are 97.30%, 13.04%, and 0.00% (marked with black rectangles in Table VI), respectively. The model of RSSAN, a combination of the 2-D residual block and attention module, simplifies the network and further improves the classification results as well. Among all compared methods, REFIPN reaches higher OAs on three data sets. This is attributed to the spatial attention modules, which refine the spectralspatial features extracted by its pyramidal network. Last but not least, the proposed SpaAG-RAN model not only receives the best result of OA, AA, and κ but also obtains the overwhelming advantages on the higher recalls of categories comparing with the other methods. Different from the others, the SpaAM captures the salient areas based on the spectral similarity, which weakens the interfering pixels from the neighborhoods of the center pixel. The above superior classification performances demonstrate the effectiveness of the SpaAM as well as the excellent classification ability of the proposal.

2) QUALITATIVE COMPARISONS
The ground-truth (GT) images and visual classification maps of different methods on three data sets are shown in Fig. 16-18. Comparing with the other methods, RSSAN, REFIPN, and the proposal obtain the purer and smoother classification maps. Different from the UP and BW data sets, the distribution of the land-covers on the IP data set tends to be concentrated, which brings a certain of challenge to distinguish the useful pixels from the unwanted pixels. For example, the categories of "Corn" and "Grass-pasturew d" (C4 and C7) are always not classified accurately by most compared methods. Even so, the proposed model still   Table VII). Where "Cn" represents the n-th category.
acquires the highest accuracies in these two categories. The similar result can be seen from the category of " ip ri n" (C6) on the UP data set and the category of "Reeds1" (C5, the partial results are displayed in the white square boxes of each sub-figure of Fig. 18) on the BW data set. On the whole, in comparison with other methods, the proposed SpaAG-RAN model acquires the excellent classification maps which are almost as same as the corresponding ground-truth images of three data sets. This is because the SpaAG-RAN model is able to extract the discriminating spectral-spatial features from the relevant spatial areas.   Table VIII). Where "Cn" represents the n-th category.

3) TIME CONSUMPTION
The training and testing time of the proposed SpaAG-RAN model and the compared methods on three data sets are reported in Table IX. The training time has a closer link to the complexity of the network whereas the testing time intuitively reflects the efficiency of the algorithm in practical application. Among the thirteen methods, the two traditional methods (i.e. SVM and RF), cost less time obviously. In the DL-based methods, 2D-CNN owns the fastest speeds as it extracts the spatial features from a single-channel image merely. The RSSAN, which adopts the 2-D CNN as the basic architecture, also reaches the second faster speed on three data sets. The remained methods all utilize the 3-D convolution for feature extraction, so the computation times of them are lengthy. However, the classification performances of them are jagged. It should be noted that the SSAN costs a considerable long time to finish the training procedures as the convolutional layers and the self-attention module both have a great deal of parameters. Among the methods based on 3-D CNN, the CSSEAN consumes the least time to finish training and testing, which has a close association with the pooling layers of it. Nevertheless, the proposed SpaAG-RAN model generates the spatial attention mask via an efficient subtract operation and introduces the few convolutional kernels with a smaller size to extract the spectral-spatial features, which results in a relative fast and efficient performance on three data sets.  Figure 19. Visualization of the spatial attention masks and feature maps of the partial samples from the IP, UP, and BW data sets.

E. Visualization of the Spatial Attention and Features
In this part, some visualization studies are conducted to illustrate the power of the SpaAM intuitively to infer the relevant spatial areas. For each sample, its GT map, two spatial attention masks, i.e.
(left) and (right), and the features , which are acquired from the layer named AP (see Table I) in the SSFEM, are visualized in Fig. 19. For the convenience of display, the spectral dimension of the features is compressed.
From this figure, the relevant spatial areas described in the two spatial attention masks have the similar spatial structures with the corresponding GT maps. More important, the center pixel of each sample is contained in the relevant spatial areas and is assigned the highest spatial weight, which reveals that the production of the spatial attention takes the center pixel into fully account. With the restriction of the spatial consistency loss function, the two spatial attention masks also have the extremely similar distribution. Therefore, as shown in Fig. 19, most feature maps tend to focus on the relevant spatial areas. It is the two spatial attention masks that guide the SSFEM to extract the discriminating spectralspatial features from the relevant spatial areas only.

IV. CONCLUSION
In this article, a novel SpaAG-RAN model is proposed for HSI classification, which contains a SpaAM, a SpeAM, and an SSFEM. The SpaAM aims to highlight the relevant spatial areas. The SpeAM aims to emphasize the spectral bands which are beneficial to the representation of characteristics. The SSFEM is designed to extract the spectral-spatial features. In the lightweight spectral similarity-based SpaAM, a novel inverted-shifted-scaled sigmoid activation function is designed to convert each spectral similarity to the appropriate spatial weight. With the guidance of the SpaAM, the SpeAM and the SSFEM can work better. At the same time, to consolidate the capability of the SSFEM to discern the subtle differences between the homogeneous pixels and the interfering pixels, the spatial consistency loss function is exploited to preserve the stability of the spatial attention masks. The experimental results on three public data sets demonstrate the validity of the SpaAM and the outstanding classification performances of the proposal.
However, the scale and the threshold parameters of the inverted-shifted-scaled sigmoid activation function are set manually, which vary in different scenarios. One of the future directions of this work is to realize the adaptive selection of these two parameters. Moreover, a more effective spectral similarity measurement to acquire the more precise spatial attention mask is demanded as well.