Heterogeneous Spectral-Spatial Network With 3D Attention and MLP for Hyperspectral Image Classification Using Limited Training Samples

Methods based on convolutional neural networks (CNNs) have become a vital offshoot for hyperspectral image (HSI) classification. In recent years, the 3-D CNN (3-DCNN) has become dominant due to its excellent capability of extracting features. However, the high dimension and the limited training samples of HSI usually restrict the improvement of its classification accuracy. And the parameters of conventional 3-DCNN are larger so that computational complexity and running time increase. Therefore, a new model named HSSAM is proposed to solve the above problems. First, a 3-D residual-dense asymmetric convolution (3-D-RDAC) is designed to reuse the features, while reducing the parameters. Subsequently, 3-D-RDAC combined with the multiscale convolution to construct a 3-D multiscale RDAC (3-D-MRDAC) for avoiding the blind spots and unrecognized regions of receiving fields. Then, 3-D attention SimAM is applied to 3-D-MRDAC, for constituting the heterogeneous spectral-spatial attention convolutional neural (HSSAN) block, to extract spectral-spatial features of HSI adequately. Ultimately, MLP acts as the output layer of the model to better deal with the nonlinear features of HSI. Experiments in this article are carried out on four famous hyperspectral datasets: Indian Pines; Pavia University; WHU-Hi-LongKou; and WHU-Hi-HanChuan. Results show that HSSAM achieves better classification accuracy with limited training samples than several existing models. Overall accuracy reaches 96.84%, 98.85%, 98.01%, and 97.18% on the four datasets, respectively.

grayscale image and RGB image, hyperspectral image (HSI) is the 3-D data with hundreds of consecutive spectral bands, which provides extremely rich spectral information and detailed spatial texture information [1].HSI has been widely used in many fields, such as plant disease detection [2], mineral exploration [3], ecosystem measurement [4], urban management [5], land change monitoring [6], and other fields [7], [8], [9].For improving applications of HSI, multifarious works related to HSI have been developed, including HSI classification (HSIC) [10], HSI band selection [11], and HSI anomaly detection [12].Therefore, HSIC acts a vital role in the tasks in HSI processing, with the purpose of distinguishing the land category for different pixels [13].
Primordially, the similarity of spectral information is initially measured by statistical algorithms, and on this basis, hyperspectral pixels are distinguished [14], [15].However, the spectral characteristics of the same land objects may be different, and similar spectral characteristics may correspond to different land objects.Therefore, the accuracy of such statistical methods is limited.In earlier times, HSIC has mainly adopted some traditional machine learning models.The number of spectra for HSI is generally more than 100, while the actual categories of land objects are generally less than 30, so the information redundancy exists in different bands.Firstly, machine learning algorithms usually adopt principal component analysis (PCA) [16], independent component analysis (ICA) [17], linear discriminant analysis [18], and so on to reduce spectral redundancy.And then use classifiers, such as decision tree [19], support vector machine [20], K-nearest neighbor [21], and so on for classifying the pre-processed features.Compared with previous statistical methods, machine learning methods have achieved considerable performance improvement.However, they rely on artificially designed features of classification, which is hard to extract more complex information from hyperspectral data [22].
By the reason of the speedy development of deep learning [23], the approaches have more ascendant than machine learning technology to extract more abstract information by the multilayer neural network, which can enhance classification performance [24].At present, deep learning methods have become one of the main-stream means of HSIC.Chen et al. [25] first introduced the theory of deep learning into the field of HSIC, which extracted spatial-spectral features from HSI by using stacked autoencoders and achieved good classification results.Subsequently, various deep learning networks have been used to precisely classify HSI, such as deep belief networks [26], convolutional neural networks (CNNs) [27], graph convolutional networks [28], and multilayer perceptron (MLP) [29], and other models [30], [31], [32].Among them, CNNs occupy the dominant position in HSIC.
The CNNs methods used for the classification of HSI contain 1-D CNN (1-DCNN) [33], 2-D CNN (2-DCNN) [34], and 3-D CNN (3-DCNN) [35].Originally, mainstream methods of HSIC 2-DCNN could only extract spatial features.Inspired by the ability of 1-DCNN to extract HSI spectral features, 3-DCNN gradually became the mainstream method of HSIC, which could extract spectral-spatial cooperative features.3-DCNN could effectively enhance the classification effect, while with a mass of computational cost.Thus, asymmetric convolution came into being [36], which adopted two convolution kernels of 1×3 and 3×1 to replace the original 3×3 convolution kernel, and the precision is maintained while the number of parameters is greatly reduced.At the same time, with the layers of the network increasing, the problem of gradient vanishing and gradient exploding would appear, and the information of shallow features be incapable of reusing in CNN.Residual neural network (ResNet) [37] with skip connection and dense convolutional network [38] with dense connection were designed to solve the above problems.Subsequently, the two ideas were introduced into 3-DCNN.A spectra-spatial residual method for HSIC was designed by Zhong et al. [39], which can relieve the declining accuracy.Paoletti et al. [40] increased the flow of information by adding dense connections between different layers of the model for achieving superior HSIC performance.However, most residual and dense models do not effectively reuse the spectral-spatial features of previous layers, while usually possessing large parameters.Meng et al. [41] proposed a residual-dense asymmetric convolutional network (RDACN), which can combine the advantages of residual joins and dense joins to make use of the previous layer information.Nevertheless, this model only used 2-DCNN, which ignored the joint information of spectra and space on HSI.
In recent years, the rise of attention mechanisms has brought a new research direction to deep learning, which has been extensively used in HSIC [42].Wang et al. [43] proposed a new ResNet model that adaptively learns the weights for different spectral bands and different neighboring pixels of HSI, with the squeeze-and-excitation [44] module.Li et al. [45] proposed the double-branch dual-attention model with the dual attention network [46], which concentrates on the spectral-spatial association information of HSI.However, these HSIC attention models extract spatial information and spectral information respectively, ignoring the relationship between them.Yang et al. [47] proposed SimAM for directly deducing the 3-D attention weights of features.Meanwhile, the comeback of MLP has provided researchers with new ideas for HSIC.Since the process of sunlight reflection by ground objects, and the transmission of incident light and reflected light by ground objects in the air are nonlinear processes, the hyperspectral data has large nonlinear characteristics.MLP has hidden layers, which are better for nonlinear features than ordinary linear layers.In addition, the translation invariance and local connectivity of CNN would affect the effectiveness of HSIC [48].MLP has fewer constraints, which can eliminate the above defect and pay attention to spatial structure and information.
Hence, based on RDACN, a 3-D residual-dense asymmetric convolution (3-D-RDAC) is presented in this article, meanwhile, to avoid the blind spots and un-recognized regions in the receiving field effectively, a 3-D multiscale RDAC (3-D-MRDAC) is designed by combining 3-D-RDAC with the multiscale kernel.At the same time, this article combines the 3-D attention mechanism SimAM with the designed 3-D-MRDAC to form a heterogeneous spectral-spatial attention convolutional neural (HSSAN) block, which automatically deduced required 3-D attention weights for features without adding original parameters.And MLP is selected as the output layer of the proposed model, and the HSSAN network is finally constituted, which improves the accuracy of HSIC.
The contributions of the proposed HSSAM are mainly fourfold.
3-DCNN extracts the spectral-spatial association information, and the residual-dense connection is used for reusing the features and studying feature information of convolutional layers adequately, while asymmetric convolution is adopted to reduce the parameters of the model.2) A 3-D multiscale residual-dense asymmetric convolution is designed.The blind spots and unrecognizable areas of the receiving field can be avoided.At the same time, when facing difficult features, information can be fully extracted through fewer layers, which promotes the efficiency of the model.
3) The HSSAN block with SimAM attention is designed.
The block can directly deduce the required 3-D attention weights of features in the model so that screens out more vital information for HSIC from a mass of information.4) A heterogeneous spectral-spatial network is proposed for HSIC, which MLP used as the output layer.MLP is better at capturing the long-range dependencies and better dealing with non-linear features of HSI.The rest of this article is organized as follows.Section II elaborates on the related theory of works.Section III describes the proposed method HSSAM in detail.Section IV introduces a variety of experiments on four famous HSI datasets.Discussion is presented in Section V of ablation analysis, and comparison of performance for different training samples and different window sizes.Finally, Section VI concludes this article.

A. 3-D Convolution
As shown in Fig. 1, 3-D convolution refers to the process of dot multiplication and summation of corresponding positions between the 3-D convolution kernel and 3-D data.Input data is let as X input ∈ R H×W ×B , where height is H, width is W , and bands are B. The v is the value of the ith feature map at the spatial position (x, y, z) in the lth layer [49], which mathematical expression is where N l is the amount of convolutional kernels of the lth layer, the value for the nth 3-D convolution kernel of ith layer at the location (h, w, b) is expressed as δ hwb lin , and the bias of the lth layer concatenated to the ith 3-D feature data is indicated as b li .The ϕ(•) refers to the activation function.

B. SimAM Attention
Because 3-D attention weights are superior to the traditional 1-D and 2-D attention weights, Yang et al. [47] proposed SimAM for directly deducing the 3-D attention weights of features.For better achieving attention, the importance of each neuron should be evaluated The simplest way to find important neurons is to measure linear separability between different neurons, which energy function is in (2).The t is the target neuron, and o i is other neurons, which are in a single channel of the input features X input ∈ R H×W ×B .Where t = ω t t + b t is linear transforms for t, and o i = ω t o i + b t is linear transforms for o i .The i and N = H × W are indexes over spatial dimension and the amounts of neurons of that channel.The values in (2) are scalars.While, when the t equals to y t and all other o i are y 0 , the (2) attains the minimal value.Here y t and y 0 are two different values, which could be adopt binary labels (i.e., 1 and -1) for simplistically.The ω t and b t are deduced as are mean and variance, and they are calculated over all neurons of that channel barring the t neuron.Q is the count of energy functions of each channel.The minimal energy is where the lower energy e * t , the more difference between neuron t and surround neurons, which more prior for visual processing.
The entire refinement phase of SimAM is where Xis the input feature.E groups all e * t across the channel and spatial dimensions, and Sigmoid is used for limiting large amounts of value in E.

C. Multilayer Perception (MLP)
MLP is constituted by a series of fully connected layers, including the input, output, and hidden layers [50].The data is propagated forward from the input layer to the output layer.If the input of lth layer is set as i l−1 , the output o l is derive from Where W l is the weight of layer l, b l is the bias of layer l andϕ(•) is on behalf of the nonlinear activation function [29].MLP with Gaussian error linear unit (GELU) [51] used in the article is shown in Fig. 2. The mathematical expression of GELU is where x is input,ϕ(x) is the cumulative distribution function of Gaussian N (μ = 0, σ = 1), and so is erf(z) = z 0 e −t 2 dt.

III. PROPOSED METHOD
The HSSAM method proposed in the article is a heterogeneous spectral-spatial attention network combined with SimAM, and MLP, which detailed structure is illustrated in Fig. 3. Multiscale kernel combined with proposed 3-D-RDAC to form the 3-D-MRDAC, SimAM is inserted into 3-D-MRDAC for constituting the HSSAN block, and MLP acts as the output layer of the model.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.From Fig. 3, the original input data of HSSAM could be represented as X orig ∈ R H×W ×B , of which height is H, width is W , and bands are B. Since the large amounts of HSI bands and the strong pertinence among different bands, PCA is adopted for extracting the dominating components of features to remove the redundancy of spectra.The data after PCA is reshaped to X P CA ∈ R H×W ×C , of which C is the amounts of spectra after reducing dimensions.
The first pure 3-D convolution layer is utilized to receive the input data successfully and to obtain more concise and serviceable features information.HSSAN blocks are employed for extracting the spectral-spatial features of hyperspectral data to the maximum extent, and adaptiveAvgPool3d is utilized to obtain the expected size of output features.MLP acts as the output layer that would better dispose of the nonlinear features of HSI.Multiclassification cross entropy function is selected as the loss function for experiments, which mathematical expression is shown as where N is the number of land-cover classes, P = [P 0 , is the probability that the sample belongs to class n, and is the one-hot representation for the label of the sample, which is Y n = 1 when the sample belongs to class n, otherwise Y n = 0.

A. 3-D-RDAC and 3-D-MRDAC
The conventional n × n convolutional kernels can be replaced with 1 × n and n × 1 asymmetrical convolutional kernels, which could aggrandize the kernel skeleton of CNN, meanwhile reducing parameters [52].
The residual connection could reuse the input features, which promotes the classification performance of CNN.The dense connection heaps up the output features from each layer of CNN on the channel dimension, which uses information from all convolutional layers adequately but may have large redundant information.Meng et al. [41] proposed the RDACN, which only adopts the dense connection to the first layer of CNN, to combine the advantages of residual connection and dense connection, while solving the problem of information redundancy.However, this method only uses 2-DCNN which ignores the spectral-spatial features of HSI.
From Fig. 4, 3-D-RDAC is proposed in this article based on RDACN, which extracts spectral-spatial features richly, while combining the advantages of residual connection, dense connection, and asymmetrical convolutional kernels.Twice the output features obtained by the (3×3×3) convolutional layer are stacked with its input features, which are for the input of subsequent layers.Then, these features are successively entered into the asymmetric convolution layers of (1×1×3), (1×3×1), and (3×1×1) for learning the features information adequately, while reducing the network parameters.The (1×1×1) convolution layer is for enhancing the expressiveness of features.Finally, the input features are summed with the output features to get the final output features.
CNN is quite sensitive to the scale of the convolution kernel, and there may be blind spots and unrecognizable areas in CNN receptive fields.At the same time, when the difficulty of features increases, the number of layers for CNN should also be deepened to better extract features information, which will lead to the rise of model parameters.As shown in Fig. 5, multiscale convolution is introduced into 3-D-RDAC to form 3-D-MRDAC, which can improve the scale invariance of CNN, avoid blind spots and unrecognizable areas in receptive fields, and reduce model parameters for elevating the efficiency of the model.

B. HSSAN Block
Attention mechanism could screen out more vital information for current target tasks from a mass of information [53].Attention modules typically operate along channel dimensions or spatial dimensions, which generate 1-D or 2-D weights and treat neurons in each channel or spatial location equally.Channel attention (1-D attention) treats different channels discriminatorily and treat all spatial locations equally.Meanwhile, spatial attention (2-D attention), treats different spatial locations discriminatorily and treat all channels equally.It is may limit their ability to learn more about discerning cues.Therefore, the 3-D attention SimAM is used for the model in this article.
The proposed HSSAN block in this article is shown in Fig. 6, the feature maps after the concatenation are received by SimAM attention, which deduces the weights required 3-D weights for HSIC.Then, the processed features are transmitted to the multiscale asymmetrical convolutional kernels.

C. HSSAM for HSIC
Taking the LK dataset as an example, the detailed setting of structure for the proposed HSSAM method and HSSAN block  As given in Table I, for making better use of spatial information and controlling the number of parameters, the spatial window size with (15×15) is selected to send into the model.Firstly, a pure 3-D convolution layer with (3×3×3) is set to guarantee the network for receiving the input data successfully, and the number of its filters is set as 48.Second, two HSSAN blocks are set consecutively to learn the spectral-spatial features of hyperspectral data to the maximum extent.Then, the BN-Mish-Conv3d_2 layer with kernel size (1×1×1, Counts of Classes) is set to ensure that the features are exported from the model smoothly.Subsequently, the BN-Mish-AdaptiveAvgPool3d layer is used to get the desired output size with (1×1×1, Counts of Classes).Finally, the output layer is adopted MLP for better dealing with nonlinear features of hyperspectral data.
As given in Table II, the first layer is BN-Mish-Conv3d_1 with size (1×1×1, 48), and then the input of the first layer is concatenated with twice the output of the first layer to the enhanced transmission of features.The second layer is the SimAM attention, which is used to deduce the 3-D attention weight directly without increasing the number of parameters.Then, there are the multiscale asymmetric layers and the detail size is shown in the table (from BN-Mish-Conv3d_21 to BN-Mish-Conv3d_33).Finally, the BN-Mish-Conv3d_4 layer with (1×1×1, 48) is used to reduce the number of filters and control the number of parameters.

IV. EXPERIMENTS AND RESULTS
This section reports some HSIC experiments on four famous HSI datasets, meanwhile, provides the detailed analysis of the results for HSIC.

A. Hyperspectral Datasets
The performance of the proposed HSSAM method in this article was evaluated on four public hyperspectral datasets, including the Indian Pines (IP), Pavia University (PU), WHU-Hi-LongKou (LK), and WHU-Hi-HanChuan (HC) datasets.The information of all the datasets is given in Table III

B. Experimental Setting
The related experiments in this article were performed on the 5.2 GHz Intel Core i9-12900K CPU, 32 GB memory, and Nvidia GeForce RTX 3090Ti graphics card with 24 GB video memory.The window size, batch size, and epochs in this article were set to 15×15, 16, and 300, respectively.The initial learning rate for Adam and the number of principal components chosen by PCA for the four datasets are given in Table VIII.To improve the credibility of experimental results, 5%, 1%, 0.1%, and 0.5% of the IP, UP, LK, and HC datasets are randomly selected as training samples and validation samples, and the remaining samples are used for testing.All experiments in this article were carried out five times, and the final classification results were averaged from 5 experiments.Well-known overall accuracy (OA), average accuracy (AA), and statistical kappa (K) coefficient were used as key evaluation indicators.
1) Selection for Learning Rate and PCA: It is necessary to select different parameters for experimenting to obtain optimal accuracy.In this section, the learning rate for Adam and the number of principal components chosen by PCA were combined to experiment with the four HSI datasets.The control variable method was adopted for each dataset in the experiment, that is to say, the window size, epochs, the number of experiments, the number of training samples, the number of validation samples,   and the number of test samples are consistent.Figs.11-14 display the combined action of Adam and PCA on the four datasets, which all (a) show the full presentation of each combination and all (b) show the optimal combination of Adam and PCA.It is concluded that, for purpose of getting optimal accuracy, the learning rates are supposed to be set as 0.001, 0.0005, 0.0005, and 0.0005, and the counts of principal components ought to be set as 30, 10, 10, and 15, on IP, PU, LK and HC datasets, respectively.
2) Selection for Convolution Kernel Size: The LK dataset was used as an example to perform experiments for verifying the effectiveness of different kernel sizes.The control variable method was similarly adopted in the experiment.7 and 3 would achieve the best classification result.In addition, it could be seen that the convolution kernel with size 5 is not suitable for the model proposed in this article.

C. Classification Results
The proposed HSSAM method in this article was compared with 3-DCNN [35], SSRN [39], S-DMM [56], SS-MLP [29], MSR-3DCNN [57], RDACN [41], DMCN [10], and AINET [58] for validating the classification performance.Where S-DMM has adopted the 50 000 epochs according to the original article, due to the slow converges of S-DMM.Meanwhile, to restore the accuracy of the original article much as possible, the selection for training samples of S-DMM followed the original article.Therefore, as the amount of some categories for IP is very small, the number of training samples required in this article cannot be reached, and then S-DMM was not applied to IP.Other methods maintain the same parameter settings as the algorithm in this article.
Results show that the proposed method can improve the accuracy of HSIC, and has fewer parameters when the training samples are limited.Asymmetric convolution can reduce the number of model parameters.Residual-dense connection and multiscale convolution kernel can extract richly the features information with fewer convolutional layers.SimAM can derive the required 3-D weights of features.MLP can better deal with non-linear features of HSI.The classification results of the method proposed in this article and all the comparison methods are given in Tables X-XIII.It could be seen that the proposed HSSAM method achieves almost the best results on evaluation indicators, with OA reaching 96.84%, 98.85%, 98.01%, and 97.18%, AA reaching 93.40%, 98.38%, 96.21%, and 95.08%, and K reaching 96.40%, 98.48%, 97.37%, and 96.71%, on the IP, PU, LK, and HC datasets, respectively.
Meanwhile, to fully evaluate the classification performance, Figs.15-18 show the classification maps of all the comparison methods and the proposed method on HSI.It could be observed that the higher the classification accuracy, the fewer noise points of the corresponding classification maps.For the proposed HSSAM method, there are relatively fewer noise points compared with comparison methods.It is demonstrated that the proposed HSSAM method is effective in enhancing the classification performance of HSI.Experimental results of different datasets are described detailed as followed.

D. Consumption and Computational Complexity
In order to comprehensively compare the proposed method with existing research methods, in this section, experiments were  conducted on the total parameters, training time, test time, and FLOPs of all methods, and the results are given in Table XIV.
All experiments were performed once, the epochs of all methods were set as 300, and other settings of experiments were the same as previously mentioned.S-DMM was not used on IP data, so the complexity of S-DMM on IP data was not shown in the table.Meanwhile, S-DMM did not converge at 300 epochs therefore, the OA value of S-DMM at 300 epochs is not displayed in the table.Although the proposed HSSAM method has a longer running time than MSR-3DCNN and DMCN, the total parameters are fewer and the OA is higher.Taking the LK dataset as an example, total parameters are reduced by 28.24% and 70.23% than MSR-3-DCNN and DMCN, respectively.Moreover, compared with 3-DCNN, SSRN, SS-MLP, and RDACN, although the HSSAM method has a longer running time and more total parameters, the OA of HSSAM is the highest.Finally, compared with AINET, the running time and total parameters are similar, while the OA of HSSAM is still optimal.
However, compared with comparative experiments, the method HSSAM proposed in this article has huge computational complexity, which is a shortcoming that needs to be improved in future learning and research.
As shown in Fig. 19, it is taking 20, 50, 100, and 300 epochs as examples, the convergence experiments of different algorithms were carried out on the four HSI datasets in this article.3-DCNN, SSRN, MSR-3-DCNN, and DMCN converged rapidly on the four HSI datasets, which was close to the optimal accuracy on the 20 epochs, and the accuracy changed little with epochs increasing.SS-MLP and AINET could converge at 20 epochs, and their accuracies were slightly improved with epochs increasing, but the OA is not high.After RDACN converged at 20 epochs, there was no significant change in the accuracy on IP and HC datasets with epochs increasing, however, there was a slight improvement in accuracy on PU and LK datasets.S-DMM did not converge at all within 300 epochs.The method proposed in this article could converge at 20 epochs, and with the increase of epochs, the classification accuracy was greatly improved and finally superior to all comparison methods.

A. Ablation Analysis
In this section, the LK dataset was still used as an example to perform ablation experiments for verifying the effectiveness of different components, and the classification result is given in Table XV.When the proposed method only with kernel size 3, its OA is only 94.07%, when the proposed method is only with kernel size 7, its OA is only 96.37%, and when the proposed method combined kernel size 3 and 7, the OA is 96.67%, which proves that the multiscale kernel size can slightly improve the classification accuracy of the proposed model.When the proposed model combined kernel size 3, kernel size 7, and SimAM or MLP, the OAs are 96.97% and 97.06%, respectively.It is proven that SimAM and MLP are all slightly helpful to multiscale kernel size.When the four components, kernel size 3, kernel size 7, SimAM, and MLP all used for the proposed method, classification performance on the LK dataset is optimal, which ulteriorly proves the effectiveness of the proposed HSSAM method in improving the classification performance on HSI.
As given in Table XVI, the effect of 3-D asymmetric spectralspatial layers was verified through the experiment.It could be seen that the parameters, the running time, and the FLOPs of the proposed HSSAM method in this article are greatly reduced by the 3-D asymmetric spectral-spatial layer.Meanwhile, the values of OA, AA, and K for the proposed model are significantly improved.

B. Comparison of Training Percentage
In this section, experiments were adopted to analyze the performance of the proposed HSSAM method with limited training Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.samples.The experimental setting was the same as previously mentioned.Fig. 20 shows the classification results of different training percentages.
For the IP dataset, 1%, 3%, 5%, and 10% of the total samples were selected as the training samples.And for the PUdataset, 0.5%, 1%, 3%, and 5% of the total samples were selected for training.For the LK dataset, since the total samples are plenty, only 0.05%, 0.1%, 0.5%, and 1% of the total samples as the training set.For the HC dataset, due to the total samples being sufficient, only 0.05%, 0.1%, 0.5%, and 1% of the total samples as the training set.Experiments have HSSAM is superior to other methods in almost all cases, in particular, training samples are not sufficient.In addition, the proposed HSSAM method has stable performance on the four HSI datasets.

C. Comparison of Window Size
Fig. 21 shows the classification performance of the proposed HSSAM method with different spatial window sizes (i.e., 7×7, 9×9, 11×11, 13×13, and 15×15) on the four datasets.With the increase in spatial window size, the classification accuracy of the proposed method continues to improve, and the performance of the proposed method is always better than all the comparison methods.

VI. CONCLUSION
This article has designed a new model for HSIC, which has the plug-and-play heterogeneous spectral-HSSAN block.In the approach, PCA is employed firstly for dimensionality reduction so as to reduce the number of bands and spectral redundancy, thus reducing the amounts of parameters and running time of the model.The HSSAN block contains 3-D-MRDAC and SimAM attention.The asymmetric convolution is used to reduce the number of model parameters.The use of residual-dense join and multiscale convolution kernel enables the model to fully extract the features information with fewer convolutional layers.SimAM derived the required 3-D weights for the HSIC.In view of the nonlinear characteristics of HSI, MLP is selected to extract the output characteristic information and capture the long-term dependence relationship.To sum up, the proposed method can improve the accuracy of HSIC and has fewer parameters when the training samples are limited.Compared with some models, although this model has higher precision but longer running time and more complex.Therefore, further work is to optimize the network structure continuously for achieving more lightweight while higher accuracy.

Fig. 7 .
Fig. 7. Image cube and ground-truth image of IP dataset.

Fig. 11 .
Fig. 11.Combined action of Adam and PCA for IP dataset.(a) Full presentation of all combinations.(b) Highlight the optimal combination.

Fig. 12 .
Fig. 12. Combined action of Adam and PCA for PU dataset.(a) Full presentation of all combinations.(b) Highlight the optimal combination.

Fig. 13 .
Fig. 13.Combined action of Adam and PCA for LK dataset.(a) Full presentation of all combinations.(b) Highlight the optimal combination.

Fig. 14 .
Fig. 14.Combined action of Adam and PCA for HC dataset.(a) Full presentation of all combinations.(b) Highlight the optimal combination.

TABLE IV INFORMATION
OF IP DATASET

TABLE V INFORMATION
OF PU DATASET

TABLE VI INFORMATION
OF LK DATASET

TABLE VIII EXPERIMENTAL
PARAMETERS FOR EACH DATASET

TABLE IX PERFORMANCE
OF DIFFERENT KERNEL SIZE

TABLE X CLASSIFICATION
RESULTS OF ALL METHODS WITH 5% TRAINING SAMPLES ON IP DATASET

TABLE XI CLASSIFICATION
RESULTS OF ALL METHODS WITH 1% TRAINING SAMPLES ON PU DATASET As given inTable XIII, SS-MLP worst performer with 0.5% training samples, and AINET is the second worst.The proposed method in this article increased by 7.36%, 2.42%, 1.85%, 12.47%, 3.24%, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE XII CLASSIFICATION
RESULTS OF ALL METHODS WITH 0.1% TRAINING SAMPLES ON LK DATASET TABLE XIII CLASSIFICATION RESULTS OF ALL METHODS WITH 0.5% TRAINING SAMPLES ON HC DATASET Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE XIV CONSUMPTION
AND COMPUTATIONAL COMPLEXITY OF EACH DATASET

TABLE XVI PERFORMANCE
OF ASYMMETRIC SPECTRAL-SPATIAL LAYERS ON LK DATASET