Expansion Spectral–Spatial Attention Network for Hyperspectral Image Classification

Deep learning is increasingly used for the classification of hyperspectral images (HSI), thanks to its ability to completely utilize the rich characteristics of this type of imagery. However, at present, most classification models proposed for processing HSI data are based on standard convolution neural networks, which prefer to learn local information rather than global information, so that it is difficult to achieve ideal accuracy in the case of insufficient training samples in real applications. In this article, we propose a novel expansion spectral–spatial attention network (ESSAN) for HSI data classification in cases of insufficient training samples. First, a dual-branch network based on expansion convolution is employed as the model backbone to extract spectral and spatial information. All feature maps produced during the dual-branch process are superimposed to combine deep and shallow features by the ResNet concept. With the design philosophy of the superposition of expansion convolutional layers, the network can increase the receptive field to gather more global contextual information. Second, the model also includes a coordinate attention block, which directs the network to weight features according to their significance and suppresses those that are irrelevant. Finally, the method was tested on the four datasets from Matiwan Village, Pavia Center, Pavia University, and Shenzhen University, utilizing 1%, 1%, 5%, and 0.2% training samples, respectively. The results showed the overall accuracies, in order, 97.96%, 99.12%, 98.73%, and 99.36%. The preliminary results demonstrate the higher efficacy and accuracy of the proposed ESSAN in HSI data classification than the other state-of-the-art.

applications have excelled in a variety of fields, including crop monitoring, the estimation of crop leaf area index inversion, the prediction of soil organic carbon, and the prediction of soil organic carbon [1], [2], [3], [4].
The purpose of remote sensing image classification is to categorize each type of feature presented in an image. There have been numerous studies on HSI classification due to its high spatial and spectral resolutions and wealth of information features. Initially, the traditional machine learning algorithms have been widely applied to HSI classification, such as using support vector machines to classify the reduced-dimensional data [5], Zhang et al. [6] proposed a spatial-spectral joint classification method based on the random forest for classification. The redundancy between a large number of HSI bands and adjacent bands leads to an increase in noise and uncertainty, which may limit classification accuracy in the case of limited training samples [7], [8]. Therefore, feature extraction and dimensionality reduction techniques have been developed. The traditional methods are used in the early days, including principal component analysis (PCA) for linear dimensionality reduction [9], [10], linear discriminant analysis [11], nonlinear dimensionality reduction kernel PCA [12], isometric feature mapping [13], and extended morphological profiles [14], [15]. However, some advanced band selection and dimensionality reduction techniques have also been proposed. Zhang et al. [16] proposed a new spectral-spatial and SuperPCA method to reduce dimensionality and extract effective low-dimensional features of HSI. He et al. [17] proposed a dual global-local attention network band selection method for high-dimensional hyperspectral data reduction.
As the volume of tasks and data continues to grow, if the training features are manually selected inappropriately, there may be misclassification, resulting in the accuracy not meeting the expected results. Therefore, a new method of machine learning, deep learning, has emerged. Convolutional neural networks (CNNs) have been somewhat successful in the categorization of ground feature objects from HSI imagery. 1-D CNN is straightforward and requires little hardware configuration [18]. Wei et al. [19] parsed raw hyperspectral data using 1D-CNN to extract and classify hierarchical spectral features. However, 1D-CNN only uses the 1-D vector pixel information, whereas 2-D CNNs [20] can fully utilize the rich spectral values or the spatial information in HSI. In comparison with 1-D and 2-D, 3D-CNN [21], [22], [23] combines spectral and spatial information to improve classification results, but the complicated network topology increases hardware configuration requirements.
Pi et al. [24] suggested a shallow GDIF-3D-CNN classification model using 3-D convolution to classify pure and mixed pixel sets by tweaking the parameters. Lee and Kwon [25] suggested extracting features by combining spatial-spectral contextual information with a Context-Deep CNN (CDCNN). Theoretically, a deeper network can gather more information characteristics and produce better results; nevertheless, the deeper the network gets, the greater the chance that gradient disappearance and gradient explosion will occur, which will worsen the outcomes. To address the mentioned issues, He et al. [26] proposed the ResNet residual structure. The spectral-spatial residual network (SSRN) was proposed by Zhong et al. [27] using the ResNet residual block as the primary structure. Inspired by ResNet, Wang et al. [28] created the fast dense spectral-spatial convolution (FDSSC) in which the network feeds all of the feature maps output in the previous module into the next module via dense connections to achieve the accurate classification; however, the huge amount of parameters increases the running training time. Combining convolutional layers of different dimensions into the same model can better capture the spatial and spectral information of multidimensional data, thereby improving the accuracy and generalization ability of the model. The model HybridSN proposed by Roy et al. [29] consists of a 3-D convolutional block that extracts spectral information, followed by a 2-D convolutional block that extracts spatial information. Compared with using only 3D-CNN, the use of HybridSN can reduce the complexity of the model. Tinega et al. [30] suggested a deep 3-D/2-D genome graph-based network (HybridGBN-SR) that is acceptable for small sample data and does not exhibit overfitting. Yang et al. [31] proposed a synergistic CNN that combines a hybrid convolutional module with a data interaction module.
All of the methods described above are implemented on a single branch and cannot extract information from several channels and spaces at the same time. Therefore, some researchers have proposed multibranch networks to extract the desired features separately. For instance, to categorize photographs of coastal wetlands, Xie et al. [32] created a dual-branch multilayer global spectral-spatial attention network. They employed the extended random walker approach to maximize the classification probability and build the final map. To improve the capability of extracting global information from small HSI sample data, Feng et al. [33] proposed a three-branch mixed spatial-spectral features cascade fusion network, which uses two 3-D residual modules and one 2-D separable residual block to extract features after fusing them to form a cascade fusion model.
Although the traditional convolution can produce accurate classifications, the local operation of the convolution kernel with a fixed shape size cannot obtain a large range of features, and a large amount of parameters significantly increases the computing workload. To overcome this issue, Shi et al. [34] presented the feedback expansion convolution net (FECNet) to introduce holes into the regular convolution kernel to increase the receptive field (RF) and extract more context data. Zhao et al. [35] reduced the computing costs with the hybrid depth separable residual network based on the depth separable convolution.
The vast spectral and spatial features offered by HSI increase information redundancy. The proposed attention technique [36], [37], [38], [39] enables the network to concentrate on more crucial features and enhance model performance. Ma et al. [40] created the dual-branch multiattention (DBMA) by incorporating spectral and spatial attention mechanisms in two branches of the model. Li et al. [41] suggested the dual-branch dual-attention (DBDA), which flexibly employs an adaptive attention mechanism. By mining the characteristics of the HSI spectrum from the viewpoint of a transformer, Hong et al. [42] proposed the SpectralFormer network; however, SpectralFormer does not yield high classification accuracy under small sample HSI. Gong et al. [43] proposed the spectral and spatial attention network model to apply the attention mechanism to HSI-based change detection. In addition, the transformer [44] has also been successfully applied to HSI classification tasks. The transformer uses a self-attention mechanism to learn global features, which can better capture the global relationships and contextual information in the image. Hong et al. [42] proposed a backbone network called SpectralFormer from the perspective of learning spectral sequence information. Based on this backbone network, Sun et al. [45] proposed spectral-spatial feature tokenization transformer to capture spectral-spatial features and high-level semantic features, greatly improving computational efficiency. Liu et al. [46] proposed a hyperspectral image transformer iN transformer method for drawing coastal wetland classification maps on satellite HSIs, which achieved great classification results.
Despite the good results achieved by the existing depth learning algorithms, there are still numerous issues with the classification of HSI features, such as insufficient training samples [47], [48] and a high number of parameters [49], making training slow. This article proposes a novel expansion spectral-spatial attention network (ESSAN) to address these issues and enhance the extraction of HSI global spatial and spectral information with a dual-branch CNN structure with an attention mechanism.
The main work of this article can be summarized as follows. 1) We propose a dual-branch structure based on expansion convolution to extract the features. This method reduces the number of parameters and broadens the RF while preserving the spatial-spectral data produced by each layer. 2) The model incorporates the coordinate attention block (CAB) module, which gives more weight to relevant information and suppresses unfavorable characteristics, thereby improving accuracy and robustness. The experiment demonstrates that CAB can raise the network's overall classification accuracy. 3) ESSAN combines the expanded CNN block and attention block from shallow to deep, which can effectively extract feature information from HSI in the case of insufficient samples. Moreover, ESSAN has fewer parameters. We conducted comprehensive experiments on three public HSI datasets and a self-created SZU dataset, and the results demonstrate that ESSAN outperforms state-of-theart methods in terms of classification accuracy and training efficiency. It mainly consists of three parts: the dual-branch network block, the coordinated attention block, and the expansion convolution basic structure. In addition, the cube of 13×13×band fed into the space branch; sent to the spectral branch is the patch size after PCA dimensionality reduction, that is, the patch size of 13×13.
The rest of this article is organized as follows. Section II provides a detailed description of the proposed ESSAN framework. Section III presents the dataset that was used in this study and contrasts the experimental findings of the suggested method with those of the eight other models. Finally, Section IV concludes this article.

II. METHODOLOGY
In this section, we provide a thorough introduction to the ESSAN network framework and all of its elements, including expansion convolution's fundamental structure and design, the dual-branch network module, and the attention mechanism. We also show the advantages of this approach for HSI categorization.

A. ESSAN Framework
The ESSAN framework includes three components (see Fig. 1): the dual-branch network block, the coordinated attention block, and the expansion convolution basic structure.
The area that pixel points in the output feature map on the input image maps is referred to as the RF. When the convolution kernel size is the same, the expansion convolution has a larger RF than the standard convolution. When the RF is the same, the expansion convolution has fewer parameters and a faster calculation speed than the standard convolution. We use two branches to extract the spectral and spatial information of HSI data effectively, and then combine them to derive joint features. First, to extract spatial information, we must create a small cube centered on each pixel of the original image in the three dimensions of height, width, and channel (e.g., a 13×13×band cube, where the band is the number of bands), and then pass these small cubes to the spatial branch. Similarly, for HSI pixels after PCA dimension reduction, we take a patch size centered on this pixel in the height and width dimensions, which is a 13×13 patch, and then pass it to the spectral branch to extract spectral information. Second, using a CAB, we combine spectral and spatial properties to focus on information that is more significant and gives it a higher weight, while ignoring information that is less important and gives it a lower weight. Finally, we use the fully connected layer to aggregate all features to build classification maps based on the number of land cover categories, preventing the feature locations to impact the classification results.
In this article, we divided the sample data into three categories: training set, test set, and validation set. Samples from the training set are used to train the model and adjust the parameters. The validation set is used to monitor the network performance after each epoch with updated parameters and determine the optimal combination of hyperparameters. The test set is used to evaluate the performance of the model after training is complete and determine the model's generalization ability. Cross-entropy loss is used in the network as the loss function to change the model's parameters. One way to express the multiclassification loss function is given as follows [32]: (1) where C represents the total number of categories, and X i and X target for the predicted labels of each category and the actual labels, respectively. Fig. 2 shows the analysis process of determining the RF size, calculated as follows:

B. Two-Dimensional and 3-D Expansion Convolution 1) Size of the RF:
where r l is the RF size of the lth layer, k l+1 is the convolution kernel size of the l+1 layer, and S i is the stride of the ith layer. Then, the RF needs to be enlarged on a reasonable basis to ensure that the network uses global information rather than simply local information. For instance, if the size of the input image is 13 × 13 and the RF of the pixel in the last layer of the feature map is greater than 13, it indicates that all of the information in the original image were covered by the features that were retrieved during the final classification discrimination of pixels.

2) Basic Structure of Expansion Convolution:
The traditional standard convolution typically employs convolutional layers and pooling layers to improve the RF, but because of limitations, many various convolutions are generated. Among them, to capture multiscale context information, expansion convolution can alter the field of view by modifying the expansion coefficient without altering the size of the feature map. The operation of 2-D expansion convolution is illustrated with an example in Fig. 3. Compared with standard convolution, expansion convolution has an additional hyperparameter called the expansion rate, which describes the number of gaps between the convolution kernel's points. The three images in Fig. 3 each have a convolution kernel size of 3×3, and the expansion rate from left to right are 1, 2, and 3, respectively. The red box represents the size of the equivalent convolution kernel; the blue square represents the position of the convolution kernel; and the white square within the red box represents the holes, which are typically all filled with 0. The RF size is the same as the standard 3×3 convolution kernel size when the expansion rate is 1. When the expansion rate is 2, the RF produced with standard convolution kernels of 5×5 size is equal. When the expansion rate is 3, it is the same size as the RF obtained by a convolution kernel of size 7×7 of standard convolution. The RF will differ when different expansion rates are selected, meaning that multiscale information is collected. To attain the necessary RF size in a practical application, an appropriate expansion rate should be adjusted by the size of the input image.
Equation (2) is the formula for calculating the size of ordinary convolutional RF. By replacing the size of the ordinary convolution kernel in the formula with the equivalent convolution kernel size, the expansion convolutional RF size can be derived. The equivalent convolution kernel size is calculated as follows: where K is the size of the equivalent convolution kernel, k is the size of the initial convolution kernel, and the rate is the rate of expansion rate. The RF size of layer l+1 is R l+1 . The size of the RF expands exponentially as the expansion rate rises. For the same RF, the expansion convolution has fewer parameters than the standard convolution, and the number of parameters falls off exponentially as the expansion rate rises.  The 3-D expansion convolution works on the same principles as the 2-D but with a 3-D spatial relationship instead. The 3-D convolution is subject to the same rules of RF and parameter amount as the 2-D convolution.

C. Dual-Branch Network Block
The model employs a dual-branch CNN, as seen in the ESSAN framework flowchart (see Fig. 1). The composition and layout of spatial and spectral branches are thoroughly explained in this section.
1) Expansion Rate of the Dual-Branch Expansion Convolutional Layer: Expansion convolution is frequently used because it can produce a larger RF. However, inappropriate expansion rate settings can result in gridding effect [46] issues when multiple layers of expansion convolution are superimposed. There are three expansion convolutions used sequentially in Figs. 4(a) and 5(a). While the convolution kernel size is 3, the expansion rate choices are different. As demonstrated in Fig. 4(b)-(d), three expansion convolutions with the same expansion rate only employ a portion of the input within their corresponding RF, losing some features and the correlation between information. However, there is no gap between the pixel values when three expansion convolutions [see Fig. 5(b)-(d)] with various expansion rates are stacked since all of the pixel information in its equivalent RF are employed. In all cases, the convolution kernel size and the number of parameters are the same, but the expansion rate varies, with Fig. 5 providing the preferred solution. As a result, it is crucial to design a reasonable expansion coefficient; the distribution of the expansion rate should be zig-zagged.
A straightforward method known as hybrid dilated convolution [50] was suggested, which calls for three convolutional kernel sizes of neighboring convolutional layers, whose expansion rate setting should follow the formula: The goal is to make L 2 ≤K, where K is the convolution kernel size and d is the expansion rate, L i =d i , iࢠ{1,2,3}. When the convolution kernel K = 3 and the expansion rate of the three convolutional layers d = [1,2,5], L 2 = 2<K, which meets the conditions, so the expansion rates of 3-D and 2-D in both the spatial branch and the spectral branch in this experiments are 1, 2, and 5.
2) Dual-Branch Network Block: It is challenging to train complicated CNNs for HSI classification with small sample sizes, and stacking with many 3-D convolution operations will slow the network. Therefore, we extract features from the spatial and spectral branches using a dual-branch CNN structure (see Fig. 6).
The spatial branch contains three expansion convolutional layers for extracting multiscale features; different expansion rates can obtain information features at various scales, and the third layer expansion rate of 5 can get information at the global level.
The spatial branch has three expansion convolutional layers for extracting multiscale features. The third layer's expansion rate of 5 may extract global information, while different expansion rates can retrieve information features at various scales. The 3-D expansion convolutional layer is denoted by the quantity of output feature maps-the size of the convolution kernel-and the expansion ratio (shown in Fig. 6). For instance, the 3-D convolution represented by 32-3×3×3-1 has 32 feature maps, a convolution kernel size of 3, and an expansion rate of 1. While the spatial branch employs a 3×3 convolution kernel to extract semantic position information, the spectral branch utilizes a 1×1 convolution kernel to filter unnecessary information and concentrate more on the discriminant channel. After each expansion convolutional layer, a batch normalization layer is  introduced to increase the speed of training and convergence, reduce overfitting during training, and enhance network stability. A rectified linear unit (ReLU) is added between each expansion convolutional layer and the BN layer to increase the nonlinearity of the interaction between the layers. We assemble all the feature maps produced by the expansion convolutional layer in the spirit of ResNet [26], and here, we represent the spatial branch feature map K spa and spectral branch feature map K spe as follows: where bࢠ{spe, spa}, lࢠ{1,2,3}, F b (x) is represented as the feature of the input space and spectral branch, K l b (x) represents the feature map obtained by the lth convolutional layer of the bth branch, and W l b represents the convolution kernel size. In the spatial branch, W l b ∈ R 3×3×3 ; in the spectral branch, W l b ∈ R 1×1 . r l b (x) is the increased bias; "σ" represents the activation function ReLU.
To aggregate spectral information K spe and spatial information K spa efficiently, as seen in Fig. 7 , the two characteristics need to be combined to create the spectral-spatial global joint feature F G , which is represented as follows: where represents the concatenation, and the aggregate feature F G ࢠR B×H×W includes the extensive spectral and spatial context data. To highlight crucial joint information, reduce unnecessary information, and eliminate noise, the aggregate information is then entered into the attention module to produce a weight map.

D. Coordinate Attention Block
The attention model in the CNN can help give each component of the input a different weights, select some crucial information by adjusting the size of the weight, and make each pixel in the model pay more attention to these crucial details, thus improving the training accuracy and effect. A CAB [51] is added to the network to focus the pixels' attention on various categories. The complete flowchart of the CAB framework is shown in Fig. 8.
After entering the spatial-spectral aggregate feature F G into the CAB, first, it uses the global average pooling to acquire the height feature M Height ave and width feature M width ave . After concatenating the features in the two directions, a 2-D convolutional layer, a BN layer, and an activation function called h_swish are coupled to create a remote dependence to combine data in the X and Y directions. Equation (8) where θࢠ{Height, Width}, ⊕ represents the feature connection, and "δ" indicates the h_swith activation function. The global data are currently present in each dimension of the feature map M XY . A split function is then used to separate the feature map M XY , and the value is then shrunk to between 0 and 1 using the Sigmoid activation function, which can produce two sets of weight maps along the height and width directions. The dot product operation is finally applied to these two sets of weight graphs to obtain the weighted weight maps in the X and Y directions.

III. EXPERIMENTS AND ANALYSIS
A significant number of experiments were conducted on four datasets to evaluate the performance of the ESSAN and the model's ability to recognize insufficient samples.

A. Experimental Datasets
Four hyperspectral datasets were used in this experiment. Three were from widely used public hyperspectral datasets: Matiwan Village [52] in Xiong'an New Area, Pavia Center (PC), Pavia University (PU), and a new land cover categorization database we created, named the Shenzhen University (SZU) HSI dataset. Figs. 9 -12 display the dataset's true color image, the true classification map, the color of each category, and the number of samples.    Fig. 9 shows the number of image pixels each category has indicated. 2) PC: The ROSIS sensor collected the data for the PC dataset, which covers the center of Pavia in northern Italy. The sensor has 115 bands; however, only 102 of them are present in the PC dataset after excluding 13 noise bands. With a spatial resolution of 1.3 m, the image's spatial size is 1096×715 pixels. Nine land cover feature classes may be found in the photographs.

3) PU:
The PU dataset was likewise obtained from the ROSIS sensor; 103 bands were kept after 12 noise bands were eliminated. The dimension of the image area is 610 × 340 pixels. Nine different urban feature categories, each with more than 1000 labeled pixels, are represented on the ground-truth map. 4) SZU: SZU is a university in Shenzhen, Guangdong province of China. An unmanned aircraft platform equipped with a Specim FX10 hyperspectral sensor was used to collect data on SZU. This sensor captured 112 bands with a total wavelength range of 0.4-1 μm. The radiometric calibration, geometric correction, and atmospheric correction were applied to the original data during the preprocessing stage. The images have a spatial resolution of 0.1 m and a spatial size of 8757×3373 pixels. The ground-truth data include a total of ten categories.

B. Experimental Setting 1) Sample Settings:
For each of the four datasets, an insufficient subset of pixels was chosen as training samples to test the effectiveness of the proposed network model for classification. For MV, PU, PC, and SZU, the training sample proportions were set to 1%, 5%, 1%, and 0.2%, respectively, with SZU having the biggest spatial extent and the fewest samples. Accordingly, the validation and test sample proportions were established at 3%, 10%, 5%, and 0.5%, respectively.
2) Parameter Settings: Pytorch was used to implement all of the networks in this experiment. The input size was set at 13×13 based on prior knowledge; the training period was 100, and Adam was chosen as the optimizer. We tested the five values of 0.001, 0.005, 0.0001, 0.00005, and 0.00005 for the learning rate before settling on 0.0001 as the experiment's learning rate after several iterations of testing. All model results are the average of five experimental results, and the standard deviation of five experimental results is included in the results for each category. All experimental running workstations were configured with Intel(R) Xeon(R) Gold 5218R CPU, NVIDIA GeForce RTX 3080 GPU, machine RAM of 128 GB, and a Windows 10 operating system.
3) Evaluation Factor: The benefits and drawbacks of categorization outcomes were assessed by comparing overall accuracy (OA), average accuracy (AA), and Kappa coefficient. Overall accuracy can be used as a good classification accuracy indicator when the number of samples for each category is balanced. The percentage of samples that the label correctly identified the label relative to samples of actual labels is known as the average accuracy. The degree of correspondence between each category's recognition results and the actual label can be determined using the Kappa coefficient.

C. Comparison Methods
We compared eight commonly used propagation networks, including 3D-CNN, HybridSN, SSRN, CD-CNN, DBMA, DBDA, FDSSC, and FECNet to validate the effectiveness of the proposed method on the dataset.   [23]: To extract spatial characteristics, the 3D-CNN framework utilizes 3-D convolutional layers. Three convolutional layers and three maximum pooling layers make up the model. Each convolutional layer also has a BN layer and ReLU added to it. 2) HybridSN [29]: To extract joint features of space and spectrum, the HybridSN hybrid network links 3-D convolution and 2-D convolution in series. 3) SSRN [27]: It presents the concept of skip connection for residual networks, which can utilize deeper neural networks to enhance classification performance. 4) CD-CNN [25]: A deep context network that uses local spatial-spectral properties between nearby vectors of the central pixel to investigate contextual information. 5) DBMA [40]: It has two branches to extract spectral and spatial features, and adds an attention mechanism to each of the two branches to make sure that more recognized features can be extracted. 6) DBDA [41]: Although it was developed from DBMA, DBDA introduces different attention mechanisms in spectral and spatial branches. 7) FDSSC [28]: Uses fast dense space spectrum joint convolution and the tightly coupled structure fully learns each feature to produce an extremely accurate classification. 8) FECNet [34]: FECNet increases the RF and extracts more contextual information through expansion convolution and the model includes a feedback mechanism that combines deep and shallow features. The classification accuracy and feature classification map of ESSAN and eight other models on the MV dataset are shown in Table I and Fig. 13  was higher than those from other approaches (see Table I). While ESSANs training round only takes around 32 s and FDSSCs training round takes about 14 min, both of them produce extremely precise classification results. This is because, during training, the FDSSC model inputs all the feature maps produced by the previous module into the subsequent module, leading to a massive number of parameters and, thus, a slow training process. By utilizing an expansion convolutional residual block, ESSAN and FECNet are able to gather global information while also lowering the number of parameters, speeding up training, and increasing computational efficiency without compromising accuracy. Due to the insufficient number of training samples, Hy-bridSN, SSRN, and CDCNN did not effectively extract "soybean (label16)." DBMA and ESSAN, which introduced the attention mechanism, gave more weight to important information in the case of insufficient samples and achieved higher classification accuracy. As can be observed from Fig. 13, numerous features in the SSRN, DBDA, and other models were incorrectly identified because the MV datasets are all vegetation-based and have similar spectral properties. The suggested ESSAN, however, produces a better feature classification map.

1) Three-dimensional CNN
The evaluation metrics and feature classification plots for the PC dataset are displayed in Table II and Fig. 14, respectively. The number of feature categories in PC is nine, which is half as many as in MV; however, the PC dataset performs better in terms of classification than the MV dataset. Table II demonstrates that the proposed method outperforms previous comparison methods in terms of total OA (99.12%), AA (97.61%), and Kappa coefficient (98.72%) for each feature class. FECNet, FDSSC, DBMA, and CD-CNN had the second highest OA after the proposed method, although their AA was 3.5%, 5.49%, 6.5%, and 8.43% lower than MSSANs. Fig. 14 shows that the other compared methods cannot separate "Bitumen (label 5)" well and have all misclassified "Bitumen (label 5)" into "self-locking bricks (label 4)." The 3D-CNN, HybridSN, SSRN, and CD-CNN show a lot of noise on the classification graph with a salt-and-pepper phenomenon. However, the classification graphs of the dual-branch DBMA and DBDA are noticeably superior to those of the other evaluated approaches. The proposed ESSAN accurately captures the spatial and spectral properties of the data using a dual-branch structure, as shown in Table II and Fig. 14, and the obtained classification results are the closest to the ground-truth labels.
The classification evaluation metrics and result plots for the PU dataset are shown in Table III and Fig. 15, respectively. ESSAN still achieved the highest OA, AA, and Kappa values. Compared with other methods, the proposed ESSAN performed significantly better in terms of OA. In this dataset, CD-CNN performed well (OA = 95.36%) and had the highest accuracy in classifying the "Trees (label 4)" class, at 98.28%. Fig. 15 displays the classification results for each method in the PU dataset. Zooming in on the classification plots, we can see that SSRN and HybridSN have more noise, which may be due to the large spectral variation of the same species of features causing severe feature mixing. It is apparent that the classification maps produced by FECNet, FDSSC, DBMA, DBDA, and CD-CNN are superior to the models mentioned above. In comparison, the ESSAN model generates a smoother feature classification map by fully utilizing the incorporation of global information and attention mechanisms.
The results of nine classification methods are displayed in Table IV and Fig. 16. There are ten categories in the SZU dataset and, because of the huge variations between them and the more regular features, all classification methods achieved good OA. However, the ESSAN described in this research produces the greatest OA, AA, and Kappa coefficients in SZU. In the categories of "water (label 3)," DBMA, DBDA, and FDSSC all achieved 100% accuracy, and the final acquired OA for these three models was only 0.46%, 0.33%, and 0.33% lower than the suggested ESSAN. However, the average training round time for the FDSSC is 581 s, which is 38 times longer than the ESSANs 15 s. Therefore, FDSSC training becomes extremely slow when the image space is large and there are many sample pixels, and ESSAN, with the addition of double branching and expansion convolutional residuals block, can ensure that all discriminative features are extracted in complex scenes while also speeding up training. Furthermore, as illustrated in Fig. 16, all approaches appear to mistakenly categorize "trees (label 9)" as "grassland (label 4)," with the suggested method having the fewest errors.  In conclusion, the proposed approach ESSAN in this research produced the best OA, AA, and Kappa coefficients on all datasets, as well as the most accurate ground-truth feature classification maps, proving the full potential of ESSAN in HSI data classification.  Table V presents the findings with OA serving as the criterion for accuracy assessment. Concatenating the spatial branch with the spectral branch in the dual-branch network produces the single-branch network used in the experiment. As can be seen, the dual-branch network achieves superior classification accuracy when compared with the single-branch network since it can completely and effectively extract spatial and spectral information from the  original data. In comparison with the SEC, the OA of DEC is increased in the MV, PC, PU, and SZU datasets by 2.28%, 2.51%, 5.47, and 0.7%, respectively. As can be seen in Fig. 17, the addition of a dual-branch to PU results in the highest gain in OA, although the number of training samples for PU is the smallest. This suggests that the dual-branch block is better suited for improving model accuracy in datasets with limited training samples.
Expansion convolution has enhanced the classification accuracy compared with standard convolution, and DEC and SEC employing it have higher overall accuracy than DSC and SSC. In this experiment, the patch size fed into the network is 13. RF completely covers the information in the patch size when three expansion convolutions running at various expansion rates are utilized continuously. Thus, to increase network accuracy, the expansion convolution can receive global information over   [36] and the squeeze-and-excitation (SE) [37] attention module, were compared to demonstrate the efficacy of the CAB used in this article. results from four datasets using various attention modules. As can be observed, CAB achieved the highest OA on the MV, PC, PU, and SZU datasets, which were 97.96%, 99.12%, 98.73%, and 98.73%, respectively. The CBAM assigns weights in both the channel and the spatial dimensions. The parameters become redundant when many weights are applied to the features, which does not help to increase the model accuracy overall. The interdependence between channels has been established using SE, which increases accuracy and adds a modest bit of computation In the MV dataset, SE earned the greatest AA, and in the PC dataset, the highest OA and Kappa coefficients. Lightweight CABs will not burden network computation because they simply give weight to spatial dimensions. Table VI presents that, in the PU and SZU datasets, the CAB approach had the largest OA and the best classification accuracy. The output results of the attention model were extracted as semantic features for t-distributed stochastic neighbor embedding (t-SNE) [53] dimensionality reduction, and high-dimensional data were reduced to two dimensions for visualization after applying different attention models to four datasets, as shown in Fig. 18. It can be seen that the addition of CBAM to the model does not lead to correctly distinguish between ground objects. Because of the classification phenomenon of different objects with the same spectrum in the MV dataset, the mixing of different types of features is very serious after adding CBAM. SEs visualization is far superior to CBAMs, but there is still some feature mixed in the MV dataset. In the PC dataset, SE and CAB both obtained the same OA and both "Tree" and "Asphalt" features are misclassified. The phenomenon of the same objects with a different spectrum was reduced after CAB was incorporated into the model and the same type of features were grouped into the same area. In particular, the boundaries of each type of feature in the MV dataset are very clear, and the advantages are obvious compared with the other two attention models. This also verifies the effectiveness of adding CABs to ESSAN.
3) Performance of ESSAN on Insufficient Samples: Experiments were carried out with extremely insufficient samples to confirm the efficacy and applicability of the proposed ESSAN model in HSI dataset classification with insufficient training samples. For this experiment, the MV and SZU datasets were chosen, and the training samples for each dataset were reduced by the same amount, taking 0.25%, 0.5%, 0.75%, and 0.05%, 0.1%, and 0.15% of the total samples, respectively. The outcomes of this experiment are shown in Table VII and Fig. 19. As can be observed from Fig. 19, the advantages of ESSAN over other approaches grow as the sample size decreases. When the sample size of the two datasets was reduced to one-quarter of the original, the OA of other comparison methods showed a significant downward trend. In the MV dataset, CD-CNN and DBDA had the worst results, and HybridSN had the worst outcomes in SZU. Overall, only two models-FDSSC and ESSAN-show a slight reduction in OA, and these two models outperform others when dealing with extremely tiny data. However, FDSSCs training time takes significantly longer than ESSANs. Table VII presents that ESSAN has the highest OA in both datasets. As a result, the experimental results in this section further verify the effectiveness of our proposed method ESSAN in insufficient sample situations.

4) Analyzing the Effect of the Dropout Rate:
The effect of different learning rates on accuracy was investigated in this experiment using four datasets with dropout rates ranging from 0.1 to 0.9. The OA, AA, and Kappa coefficients for the four datasets at various dropout rates are presented in Table VIII. In the proposed method, PU performs best when the dropout rate is 0.7; when it is 0.4, PC and MV provide the best classification results; SZU chose a dropout rate of 0.5. Table VIII presents that when the dropout rate increases, the classification accuracy of ESSAN approximately follows a rising and then dropping trend. In conclusion, MV, PC, PU, and SZU have dropout rates set at 0.4, 0.4, 0.7, and 0.5, respectively. 5) Impact of the Number of Training Samples: Fig. 20 shows the classification accuracy results for the four datasets in the ESSAN model with various training ratios. Here, the proportions of the training sample are represented by the horizontal coordinates, while the vertical coordinates indicate the overall accuracy. Evidently, as the number of training samples rises, classification performance on all four datasets improves. This further demonstrates the efficiency of the proposed method by showing that it can obtain higher classification performance with adequate training samples.

IV. CONCLUSION
In this article, we proposed an HSI classification model ES-SAN based on the expansion convolution. First, to improve the features' ability to be distinguished from one another, we created a dual-channel structure of joint spatial-spectral features, with both the spatial and spectral branches being blocks of residual structures based on the expansion convolution; the RF was increased by stacking of expanded convolutional layers to gain richer global feature information. In addition, the attention mechanism was added to the network to acquire the weight map to improve the ability of feature extraction, which greatly increased classification accuracy and accelerated the model's training efficiency. The comparison of ESSAN and other eight popular deep learning HSI classification algorithms reveals that ESSAN obtains optimum classification results with few training samples and has greater classification efficiency on four different datasets. In MV datasets, in particular, ESSAN provides more overt advantages. This is because all of the species in the MV dataset are tree species, there is little variation existed in the features, and the phenomenon of different objects with the same spectrum exists. As a consequence, the recognition results of the other comparison methods on MV seem to be relatively poor, whereas ESSAN obtains greater OA, AA, and Kappa coefficients. This demonstrates that ESSAN has certain advantages in identifying similar objects in HSI data. Yiming Chen received the Ph.D. degree in cartography and geographical information system from Beijing Normal University, Beijing, China, in 2018.
He is currently an Assistant Professor Fellow with the Chinese Academy of Surveying and Mapping, Beijing, China. His research interests include airground LiDAR data forest resources stereomonitoring survey.
Chengchao Hou received the B.Sc. degree in surveying and mapping engineering from Shijiazhuang Tiedao University, Shaoxing, China, in 2020. He is currently working toward master's degree in photogrammetry and remote sensing with the Chinese Academy of Surveying and Mapping, Beijing, China.
His research focuses on tree species classification based on deep learning from hyperspectral images.