Hyperspectral Image Classification Based on 3-D Multihead Self-Attention Spectral–Spatial Feature Fusion Network

Convolutional neural networks are a popular method in hyperspectral image classification. However, the accuracy of the models is closely related to the number and spatial size of training samples. Which relieve the performance decline by the number and spatial size of training samples, we designed a 3-D multihead self-attention spectral–spatial feature fusion network (3DMHSA-SSFFN) that contains step-by-step feature extracted blocks (SBSFE) and 3-D multihead-self-attention-module (3DMHSA). The proposed step-by-step feature extracted blocks relieved the declining-accuracy phenomenon for the limited number of training samples. Multiscale convolution kernels extract more spatial–spectral features in the step-by-step feature-extracted blocks. In hyperspectral image classification, the 3DMHSA module enhances the stability of classification by correlating disparate features. Experimental results show that 3DMHSA-SSFFN possesses a better classification performance than other advanced models through the limited number of balance and imbalance training data in three data.


I. INTRODUCTION
H YPERSPECTRAL image technology can combine the nanometer level's spectral resolution with the submeter level's spatial resolution, compared with RGB image and multispectral image technology. Rich spatial and spectral information enable the technology to distinguish ground objects. At the same time, hyperspectral image technology is widely used in mineral component content identification, vegetation identification, atmospheric component content analysis, soil investigation, and urban planning [1], [2], [3], [4], [5], [6]. The essence of hyperspectral image technology application is hyperspectral remote image classification. Efficiently and accurately discriminating each pixel of hyperspectral images is the core of the classification problem.
In recent years, based on pattern recognition, machine learning, and deep learning, many researchers have further mined the hidden features of the target ground combined with the spatial and spectral characteristics. These methods effectively alleviate the impact of "dimension disaster" [7] on classification accuracy. According to using labeled training samples [8], [9], hyperspectral classification methods are divided into the supervised method and the unsupervised method.
Unsupervised methods do not use any prior knowledge in the classification process using the statistical characteristics of the samples to be classified. For example, "K-means" [10], [11] and "Spectral Angle mapping" [12], [13] are used to establish connections between spectral differences and parameters that contain variance and spectral angle. These methods reduced the complexity of hyperspectral classification achieving hyperspectral classification. Unsupervised classifications only used shallow spectral features to classify, but more complex mapping relations extracted deeper features of the hyperspectral images. Literature [14] used greedy hierarchical unsupervised pretraining combined with an efficient unsupervised learning algorithm for sparse features. These methods perform a lower classification accuracy compared with supervision methods.
Supervision methods use the prior knowledge of training samples to improve classification accuracy, which is divided between traditional and deep learning methods. Various traditional supervised learning methods have been widely used in hyperspectral images classification, such as support vector machine (SVM) [15], [16], neural network [17], [18], random space [19], [20], logistic regression [21], [22], and decision tree [23], [24]. These supervision methods provide numerous solutions for hyperspectral image classification. However, under the limited number of training samples, hyperspectral images' rich spatial and spectral information is not considered to improve classification accuracy.
Supervised deep learning methods significantly improve the classification accuracy of hyperspectral images excavating the useful features of target ground, which connect training samples with the label through nonlinear mapping. Many types of networks have been applied in hyperspectral image classification. For example, deep belief network [25], [26], autoencoder [27], [28], convolutional neural network (CNN) [29], [30], [31], generative adversarial network [32], [33], [34] and cyclic neural network [35], [36], [37]. In the aspect of autoencoder, Chen [38] introduced the concept of stacked autoencoder into This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ hyperspectral image classification for the first time to realize the fusion classification of spatial-spectral features. Ma [39] strengthened the robustness of the automatic coding network from the point of two-D spatial information using limited training samples. In terms of a deep belief network, Chen [40] introduced a deep belief network into hyperspectral image classification for the first time and proposed a new deep learning framework for feature extraction and classification based on principal component analysis (PCA), deep learning, and logistic regression. Chen [41] proposed a deep belief network based on a conjugate gradient updating algorithm, which provided a new method for hyperspectral classification through unsupervised training and supervised fine-tuning of spectral color data. In the CNN model parts, Hu [42] only regarded spectral information as training samples realizing hyperspectral image classification, establishing the connection between spectral information and label information using a 1-D convolution layer. Compared with the traditional classification methods, this network improved the classification accuracy to a certain extent, but a single spectral feature limited the model performance. Therefore, many researchers [42], [43], [44], [45], [46] avoided the model classification performance limited by a single feature and used the PCA method to extract spectral features and use 2-D convolution kernels to extract spatial features of spectral features. These convolution networks realized hyperspectral image classification by the fusion of spatial features and spectral features. Although the combined method of spectral dimension reduction spatial classification avoids the influence of a single feature on model performance to a certain extent, dimension reduction leads to the loss of part of sufficient spectral information. Under the limited number of training samples, these methods absent spatial-spectral feature extraction ability in the classification process. A 3-D convolution network handles spatial information and spectral information to extract spatial-spectral features, which connects the spatial-spectral features with the target label through nonlinear mapping. In feature extraction, the spectralspatial models extract features of hyperspectral data from two aspects of spectral and spatial dimensions. The models avoid the degradation of model performance caused by a single feature or lack of information. Zhong [47] designed an end-to-end spatial residual network, which uses original data as model input data to learn distinguishing features from spectral and spatial residual blocks. Wang [48] proposed an end-to-end, fast and dense spatial-spectral convolution framework to complete the extraction of spectral features and spatial features in HSI through dense spectral blocks and space blocks. Lu [49] proposed a multiscale spatial-spectral residual network based on three-D channels and spatial attention, which use three layers of parallel residual by different 3-D convolution kernel to extract spatial-spectral features. Model performance inevitably meets the bottleneck in complex scenes that need to be finely classified, Hong provides a general multimodal deep learning (MDL) framework, which applies to pixel-wise classification tasks and spatial information models [54]. Xu [55] found that the conventional convolution kernels can not process rich spatial information effectively, so the spectral-spatial residual graph attention network is proposed including spectral residual and graph attention convolution modules. Hong proposed spectral former network to enhance the representation of the sequence attributes of spectral signatures, which learn spectrally local sequence information from neighboring bands [56]. The labeled samples available for training are extremely limited, Pu [57] proposed semisupervised spatialspectral dual-path networks, which improve classification results by exploring the fusion features to reconstruct supervised and unsupervised features in turn. Local semantic change is an important factor in identifying the same materials, Hong extracts invariant features in both spatial and frequency domains [58]. According to the above discussion, we can conclude two crucial conclusions. First, the spatial and spectral information can enhance the accuracy of HSI classification from the point of information classes. Second, the larger spatial size of the training number and the major number of training numbers are the key points to enhance the classification performance of methods. In other words, when most of CNN's models adopt a minor number of samples and a smaller spatial size of samples during the training phase, the classification performance is mired in recession.
Recently, transformer, which has achieved great success in natural language processing, have come into use in hyperspectral image classification. Zhang proposed a convolution transformer mixer (CTMixer) network combining the advantages of a vision transformer and CNN, which also use the MHSA mechanism to improve classification accuracy [59]. Mei built a multihead self-attention that encodes the semantic contextaware representation to obtain discriminative features [60]. He proposed the HSI-BERT model, which has a generalization ability using the MHSA layer. However, these methods use liner and convolution layers to extract discriminative features, which ignore the tiny local feature [61]. Pu [64] found the question that inhomogeneous pixels or inherent spectral correlation influence the classification performance of CNN, the MFNSAM is proposed to improve the representational capacity in spatial and spectral domains, which exploits the channel-wise attention branch. Pu [65] proposed AATN to effectively extract the joint spatial-spectral features from a small training set, which takes advantage of the split-transform-merge strategy to change the number of cardinalities for improving generalization ability.
To relieve the performance decline by the number and spatial size of training samples, based on a 3-D multihead self-attention spectral-spatial feature fusion network (3DMHSA-SSFFN) is proposed. This network realizes HSI high-precision classification using a limited number of training samples. The network comprises step-by-step feature extracted (SBSFE) blocks and a 3-D multihead self-attention (3DMHSA) module. In the step-by-step feature extracted blocks, a step-by-step strategy is adopted to extract input hyperspectral images cube of spatialspectral feature, which realize by implementing multiscale characteristics of convolution kernels. Compared with the SSRN, 3DMHSA-SSFFN is lighter in structure. In the 3DMHSA module, we use complex calculations to enhance the correlation between spatial and spectral features. Thus, this article investigates the availability and advancement of 3DMHSA-SSFFN feature learning classification from different comparative experiments.
The three significant contributions of this article are listed as follows: 1) The designed network adopts step-by-step feature extracted blocks to relieve the performance decline by the number and spatial size of training samples, which implements by multiscale characteristics of the convolution kernel. 2) This article validates the 3DMHSA module using point convolution as an effective method to improve classification stability under the limited number of training samples.
3) The combination of step-by-step blocks and the 3DMHSA module has good universality in three commonly hyperspectral datasets. The proposed network uses limited training data with fixed space size to achieve the most advanced classification accuracy. The remainder of this article is organized as follows: Section II describes the 3DMHSA-SSFFN network in detail. Section III verifies the proposed model by experimenting with other advanced HSI classification methods. Finally, Section IV summarizes this article and suggests future research.

A. Step-By-Step Feature Extracted Blocks
CNN network has achieved the most advanced results in hyperspectral image classification. Numerous research results show that increasing network depth can improve classification accuracy, but excessively increasing the network feature extraction layer will lead to a decrease in classification accuracy. This phenomenon is caused by the more bottomless construction of the network and limited training samples. Many classification models alleviate the decreasing-accuracy issues by optimizing the residual structure between feature layers and simplifying the structure of the convolution layer. Therefore, due to the correlation of different convolution scales between spatial and spectral features, step-by-step feature-extracted blocks are designed to extract multiscale spatial-spectral features from 3-D-HSI cube data. This step-by-step feature extracted block makes gradients propagate steadily at different levels, thus improving the quality of training and speeding up the training process. Table I lists the mathematical notations used in this article.
In the first part of the step-by-step feature extracted blocks, as shown in Fig. 1, convolution kernels of size 9 × 1 × 1 are used in spectral filter banks G g=1 m+1 for m th layer. At the same time, the spatial-spectral size of 3-D feature cubes X m+1 is changed through a stride strategy, which means that the output feature cube X m+1 is compressed after convolutional operation in the spectral dimension. Then, these two convolution kernels of size 9 × 3 × 3 are used in spectral-spatial filters G g=2 m+2 and G g=2 m+3 for m + 2 th and m + 3 th layers, respectively. Significantly, the feature cubes X m+2 and X m+3 are unchanged through a padding strategy. The block's output map consists of cubes X m+3 with cubes X m+2 using a skip connection. The first block architecture can be formulated as follows: where the convolution kernel G g=1 m+1 and G g=2 m+i (i = 2, 3) different in parameter g. The convolution kernel G g=1 m+1 compresses the spectral redundancy information. The G g=2 m+i (i = 2, 3) is extracting the spatial-spectral feature.
In the second part of the step-by-step feature extracted blocks, as shown in Fig. 2, a focus is primarily placed on the feature extraction strategy using two types of convolution kernels of size in the successive filter banks G g=2 n+1 and G g=2 n+i (i = 2, . . . , 7). The convolution kernel of P × 1 × 1 is used to compress the input feature cube X n+1 in spectral dimensions. The convolution kernels of size 1 × 3 × 3 are used in spatial filters G g=2 n+i (i = 2, . . . , 7) using a padding strategy. These feature cubes X n+i (i = 2, . . . , 7) are connected by a skip. The second block architecture can be formulated as follows: where X n+i ∈ R C,S,H,W represents the output feature of (n + i) th layer. The convolution kernels G g=2 n+i (i = 1, 2, . . . , 7) of (i) th layer are comprised of g = 2 convolution kernels through channel matching. The convolution kernels G g=2 n+1 and b n+1 are special 3-D convolution kernels that only consider the correlation of spectral features and ignore the correlation of spatial-spectral features.

B. 3DMHSA Module
Considering HSI data has spectral and spatial information, we propose a cascade module for spectral and spatial feature extraction and classification in HSI networks. As shown in Fig. 3, the expansion module comprises feature extraction learning and multidimensional parameter parts. Compared with traditional convolution, the 3DMHSA cascade module outputs a deep features map through complex calculation of multidimensional parameters and feature extraction to reduce the standard deviation of model classification accuracy.
In the feature extraction learning part, as shown in Fig. 3, three convolution kernels W q j , W k j , and W v j extract the query q, key k, and values v from the input X of the module where q ∈ R p×C/p , k ∈ R p×C/p , and v ∈ R p×C/p . Mainly, these pointwise convolution kernels use a padding strategy. We use p hands in the 3DMHSA module where R represents the multidimensional parameter matrixes consisted R h , R w with R s , which R h ∈ R p×C/p and R k ∈ R p×C/p denote the spatial dimension learned parameter, R c ∈ R p×C/p denote the spectral dimension learned parameter. The output of the 3DMHSA module consists of weight with values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. The module makes the best of the difference in the information that comes from content, position, and spectrum to improve the stability of the model.

C. 3DMHSA-SSFFN Network
In constructing the base on the 3DMHSA spectral-spatial feature fusion network, the grouping convolution layer and 3DMHSA cascade module are first widely used. As shown in Fig. 4, the 3DMHSA-SSFFN includes step-by-step feature extracted blocks, an average pooling layer, and a fully connected (FC) layer. 3DMHSA-SSFFN alleviated the decreasingaccuracy phenomenon by using multiscale convolution kernels to formulate the spatial-spectral feature of the receptive field. 3DMHSA-SSFFN improves classification stability by using the 3DMHSA module to extract spatial-spectral features. The model input spectral size must be limited to 103 by using PCA analysis. The spatial size of the model input is not limited. We take the SV dataset, the 3-D input of which has the size of 103 × 9 × 9, as an example to explain the designed 3DMHSA-SSFFN.
The first block contains two types of convolution layers. In the first kind of convolution layer. 249 × 1 × 1 spectral kernels with a stride of (2, 1, 1) convolves the model input to reduce the impact of redundant spectral information. This layer builds a mapping the relationship of dimensionality reduction between the input cubes and output features. In the second kind, spatialspectral convolution layers include a skip connection mapping, which uses 249 × 3 × 3 convolution kernels with a padding strategy (4, 1, 1) to learn spatial-spectral features with the size of 2448 × 9 × 9.
The second block extracts spatial-spectral features using two convolution layers, where the spectral kernel and spatial kernels analyze the feature from the corresponding angle. The 12 spectral kernels generate a feature mapping relation of spectral dimension in the spectral feature learning part. Five spatial convolution layers with the size of 121 × 3 × 3 spatial kernels extract the same size of feature cubes through a padding strategy of (0, 1, 1) in the spatial feature learning part. Significantly, 3D-MHSA modules located after the third spatial convolution layers use complicated calculations to generate a feature cube.
In the average pooling layer, 12 sizes of spatial-spectral feature cubes 4 × 9 × 9 are transformed into a 1-D feature vector with a size of 48 × 1 × 1. Then, a mapping relationship is built between classes probability vector with 1-D feature vector in a FC layer. 3DMHSA-SSFFN uses four kinds of size convolution layers and a 3DMHSA module to achieve hyperspectral image

A. Datasets Description
Three well-known experimental datasets were used: The University of Pavia (UP), The Salinas Valley (SV), and the Kennedy Space Center (KSC).
1) The UP dataset (see Table II) was captured by ROSIS sensors in northern Italy. This dataset has a spatial resolution of 1.3 MMP and a spectral range of 430-860 nm. It contains 103 spectral bands, each of which has 610 × 340 spatial pixels. The dataset contains nine different types of asphalt, meadows, gravel, trees, painted metal sheets, bare soil, and bitumen blocking bricks and shadows. In particular, the dataset used in this article is the dataset with noise bands removed. 2) The SV dataset (see Table III

B. Experimental Setup
After designing the 3DMHSA-SSFFN structure, we use the cross-entropy of the cost function to update the parameters of the proposed model. Then, the training parameter directly influences the classification result of the trained 3DMHSA-SSFFN, which contains data preprocessing, batch size, learning rate, optimizer, train epoch, and model initialization. In the data preprocessing, we standardize to enhance the comparability of characteristics of classification objects. PCA method is used to relieve the influence of redundant spectral information.  The total train epochs are set as 270, 270, and 800 for the PU, SA, and KSC datasets, respectively. The Adam optimizer is adapted to optimize the 3DMHSA-SSFFN parameters. The learning rates and head numbers will lead to the divergence and classification performance of the model. Therefore, we use the learning rate array {0.010.0090.0030.0010.00090.00030.0001} and head array {2612}. As shown in Fig. 5, the optimum learning rates and head numbers for three datasets are 0.001 and 12, respectively. The proposed network uses the Heinitialization method to initialize the weight of the convolution layer. The experimental environments include the Intel Core I7-9700K processor, 64GB DDR4 RAM, and Nvidia GTX 1080Ti with 11 GB video memory. In addition, the software environment used consists of Ubuntu 18.04.1, CUDA10, cudnn 7.6.5, and Python 3.7. All experiments were carried out in the same environment, and each experiment was repeated ten times.
The experiment includes five parts. First, we use 20 training samples per class label to analyze the performance of 3DMHSA-SSFFN, while contrasting with spatial, spectral, and spatial-spectral methods. Second, we compare other advanced classification accuracy methods by changing the input cubes' spatial size. Third, we analyze the influence of the number of training samples on the model performance. Fourth, we demonstrate that the 3D-MHSA module enhances  classification stability. Fifth, we compare classification models in train time, test time, computation complexity, and space complexity.
To certify the availability of the proposed network, we use 20 training samples per class label for UP, SV, and KSC datasets as the training dataset. In Tables Ⅴ-Ⅶ, the 3DMHSA-SSFFN achieved the highest classification accuracy and lower standard deviation than the other models through the OAs, AAs, kappa coefficients, and classification accuracies. Compared with other methods, the 3DMHSA-SSFFN elevated to 95.7%, 96.94%, and 96.14% in OAs, AAs, and kappa coefficients, respectively. These outcomes showed that the spatial-spectral methods performed better than other methods for spatial-spectral information. At the same time, spatial methods generated better outcomes than spectral methods.
Furthermore, 3DMHSA-SSFFN performed better for spatialspectral blocks performance of spatial-spectral representations.
Although there are few training samples of the same number in every dataset, the 3DMHSA-SSFFN classification performance was better than the 95% mean classification accuracy. The results showed that using the spatial-spectral blocks was available in the proposed network.
Figs. 6-8 displayed ground-truth maps and the homologous classification results of trained models in UP, SV, and     KSC data sets, respectively. Although the spatial, spectral, and spatial-spectral methods generated classification maps with a great noise, spatial-spectral methods generated results with fewer noises in some classes. Remarkably, the 3DMHSA-SSFFN provided the most precise and neat classification maps through feature learning representation of spatial-spectral blocks.
To analyze the advancement of the proposed 3DMHSA-SSFFN toward the different spatial sizes of training samples, 9 × 9 and 11 × 11 labeled samples were chosen for UP and SV datasets. DFFN uses training samples of spatial size 9 × 9, 11 × 11, and 25 × 25 in UP and SV datasets. In Table Ⅷ, the spatial size of input HSI data impacts the classification accuracy of 3DMHSA-SSFFN. 3DMHSA-SSFFN still produced high classification accuracy in the same dataset for the same spatial size of training samples because the unique spatial-spectral feature is extracted from SSRN and MSRN_B. Compared with the SSRN classification result, 3DMHSA-SSFFN presents better robustness of classification in PU and SV datasets. Compared with the DFFN network, 3DMHSA-SSFFN shows the same classification performance in the SV dataset. There is a huge difference in classification accuracy between 3DMHSA-SSFFN and DFFN in the UP dataset. Importantly, the performance reduction of the DFFN network is more obvious that caused by the spatial size of training samples To certify the robustness and generalizability of the 3DMHSA-SSFFN toward different numbers of training samples, 3%, 5%, 7%, and 10%labeled samples were chosen for UP, KSC, and SV datasets. 9 × 9 labeled samples were chosen for UP, SV, and KSC datasets. In Table XI, the proposed network still generates the best performance of classification in three HIS dataset. In especially, the improvements are clear between 3% and 7%labeled samples of the KSC dataset, the OAs achieve higher than 99%in the UP and SV datasets. The stability of the classification result is positively correlated with the number of training data through three datasets. Also, we investigate the performance (aRMSEs) of different algorithms on simulated data of mixed Gaussian distribution noise. The Gaussian distributions of different mean and variance were randomly selected from 0 to 0.08. It is clear in Fig. 9 that the robustness of SSFFNs is better than other comparative algorithms. Above all, the 3DMHSA-SSFFN gets higher classification accuracy while maintaining lower aRMSE.
To illustrate the availability and effectiveness of the 3DMHSA module for improving the stability of classification, we constructed 2DMHSA-SSFFN and SSFFN networks. In Table IX, the improvement of 3DMHSA-SSFFN is clear for the 3DMHSA module. We use 20 training samples per class label in three datasets. The 2DMHSA-SSFFN is lower than 3DMHSA-SSFFN and SSFFN in the aspect of standard deviation, which uses 2DMHSA to replace the 3DMHSA module. Although the classification accuracy of SSFFN is the highest of other models, the stability of SSFFN classification is lower than other models. The 3DMHSA-SSFFN provides a better classification result from the point of classification accuracy and stability. Table Ⅹ lists the train time, test time, computation complexity, and space complexity of the proposed method and other methods for the three datasets. As reported in Table Ⅹ, CNN3D, and CNN2D need more training time to learn category features. However, its classification accuracy is lower than SSFFNs. Compared with SSRN, the proposed networks take longer training time and more computational complexity, but they significantly achieve better classification results with lower parameters. Note that the 3DMHSA needs more training time than 2DMHSA and ordinary convolution, which means the proposed 3DMHSA provides a more stable classification performance.

D. Discussion
There are two differences between 3DMHSA-SSFFN and other classification models. First, the 3DMHSA-SSFFN adopts different spatial-spectral feature-extracted strategies to improve accuracy classification and reduce the training time. Second, the 3DMHSA module uses complicated data processing, which means that more discriminative features can be extracted.
The training set and model construction limit the HSI classification ability of supervised models. The spatial size and number of training samples in the training set are vital influencing factors. The imbalance of the training sample also impacts the model classification performance. Consequently, ensuring the consistency of training sets is a fair guarantee of experimental results when we compare other models.

IV. CONCLUSION
This article proposes a 3-D multihead self-attention spectralspatial feature fusion network for HSI classification and spatialspectral feature extraction. The 3DMHSA-SSFFN solves the decreasing-accuracy phenomenon and improves classification stability, including step-by-step feature extracted blocks and the 3DMHSA module. The results proved that 3DMHSA-SSFFN possesses the strongest robustness and enormous classification performance for three datasets. Compared with advanced methods, The 3DMHSA-SSFFN displays classification advancement through the limited number of training samples. Mainly, the 3DMHSA improves the stability of classification through the spatial feature extracted process of complex calculation. Finally, the 3DMHSA-SSFFN got the highest classification accuracy using the limited number and the limited spatial size of training samples as the training set.
The classification accuracy of the HSI model is influenced by the input of the spatial size and the number of training sets. The robustness experiment results display that the classification accuracy of 3DMHSA-SSFFN is closely associated with the number of training samples. A large number of training samples as training data can provide more spatial-spectral features in model classification. The advancement experiment shows that the increasing spatial size of input can improve classification accuracy through more spatial features extracted. Therefore, we should ensure that the input spatial size and number of training sets are the same in comparing the different model performances of classification. Under the same experiment condition, 3DMHSA-SSFFN all exhibit a tremendous classification performance in the three datasets.
In the proposed 3DMHSA-SSFFN, the multiscale convolutional layers and 3DMHSA are used to extract repeated features, which causes computing redundancy and increased training costs. Therefore, we will consider introducing the lightweight design in our future studies, so as to efficiently and quickly learn spectral-spatial features.