Center Attention Network for Hyperspectral Image Classification

Classification is one of the most important research topics in hyperspectral image (HSI) analyses and applications. Although convolutional neural networks (CNNs) have been widely introduced into the study of HSI classification with appreciable performance, the misclassification problem of the pixels on the boundary of adjacent land covers is still significant due to the interfering neighboring pixels whose categories are different from the target pixel. To address this challenge, in this article, we propose a center attention network for HSI classification. The proposed method simultaneously captures spectral-spatial features of the target pixel and its neighboring pixels for classification. Specifically, the method adopts a center attention module (CAM) that pays more attention to the features which are more correlated with the target pixel, that is, the central pixel of the sample, and then sums up the weighted features to generate more relevant and discriminative features. In this way, our method has a high potential for improving the performance of HSI classification. In addition, the CAM greatly reduces the number of parameters in the network via weighted sum of the spectral-spatial features, thus improving the computing efficiency while still maintaining classification accuracy. We evaluate the proposed method on three public datasets, and the experimental results demonstrate the superiority of our method on accuracy and efficiency compared with several state-of-the-art methods.

. Different cases of the relationship between the target pixel and its neighboring pixels in subcube samples of hyperspectral images. "*" (red asterisk) denotes the target pixel, and different colors represent different classes of neighborhood pixels. earth's surface. Therefore, HSI has gained wide application in miscellaneous domains, such as land scene classification [4], environment monitoring [5], [6], precision agriculture [7], and mineral exploration [8]. Since each HSI pixel can be regarded as a high-dimensional vector, HSI classification, as a significant direction of HSI study, aims to assign each pixel with a proper land-cover class label [9]. However, the high dimensionality of HSI and the large quantity of data compose great challenges for traditional methods to achieve ideal classification results.
Recently, deep learning has been recognized as a powerful feature-extraction tool and has shown great advantages in HSI classification [10]- [12]. In terms of whether spatial information is used, deep learning methods for HSI classification fall into spectral-based classification methods and spectral-spatial-based classification methods. The spectral-based methods [13], [14] treat hyperspectral data as a collection of spectral signatures and only use the spectral information when classifying HSIs. As a result, the spatial information of HSI data is ignored so that it is difficult to attain a breakthrough in classification performance. In contrast, the spectral-spatial-based methods [15]- [17] comprehensively integrate the spectral information and spatial information of HSI data. These methods usually take the target pixel and its neighbor pixels as a subcube sample (i.e., a patch) whose class label is that of its central pixel. In addition, Zheng et al. [18] proposed a fast patch-free learning framework which took the whole image as global spatial information. By simultaneously utilizing both the spatial information and spectral information of the subcube samples, the distinguishability of the features is significantly enhanced, thus improving the performance of classification.
Generally, the class labels of all pixels or most of them in a subcube sample are the same, as shown in Fig. 1(a) and (b). However, when the target pixel is located on the boundary of adjacent land covers of different classes, many of its neighboring pixels may actually have different labels, as shown in Fig. 1(c)-(e). In these cases, the classifier may give the target This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ pixel a label to which most pixels in the neighborhood belong rather than its real label, leading to classification mistakes, especially when the neighborhood is large. Furthermore, spectral-spatial features extracted from a subcube sample for HSI classification contain many redundant features. Not all these features have a positive effect for HSI classification, but some of them may heavily interfere with the classification performance. It is difficult to distinguish these unfavorable features because the weights of all the features are equal. These problems bring great challenges to the classification algorithms and affect the further improvement of their performance.
To address these challenges, in this article, we propose a novel method, the center attention network (CAN), for HSI classification. First, the proposed method employs 3D CNN to extract basic spectral-spatial features of the sample. Then, the method adopts a center attention module (CAM) that pays more attention to the features which are more correlated with the target pixel, i.e., the central pixel of the sample, and assigns them different weights according to their correlation levels. The CAM sums up these weighted features to generate new spectral-spatial features and meanwhile reduces the number of the features. Finally, the sample is classified by the classification module with new spectral-spatial features.
To sum up, the major contributions of this article are listed as follows. We propose a novel end-to-end method for HSI classification and first present the CAM. Specially, the CAM focuses on the features that are more correlated with the target pixel and generates more relevant and discriminative spectral-spatial features. Furthermore, the proposed method considerably reduces the number of parameters in the network by reducing the number of spectral-spatial features, thus improving the computing efficiency while still maintaining classification accuracy. Finally, experiments on three public datasets show the superiority of our method on accuracy compared with several state-of-the-art methods.
The remainder of this article is organized as follows. First, Section II reviews the works related to HSI classification. Then, the proposed method and its rationale are detailed in Section III. Next, Section IV illustrates a series of experiments and results. Finally, conclusions are drawn in Section V.

II. RELATED WORK
CNNs have been widely applied in HSI classification, as attention mechanisms have become increasingly active in this field for effective feature selection. In this section, CNNs and basic attention mechanism related to HSI classification are reviewed.

A. CNNs for HSI Classification
In recent years, CNNs have been successfully applied in the field of image processing, such as image classification [19], image recognition [20], and image inpainting [21]. CNNs are usually a multilayer network structure. When CNNs are used for classification, they mainly include two parts: a feature-extraction (FE) network and a classification network. The FE network aims to learn high-level representations of the inputs, and the classification network performs the final classification task to assign each input sample with a certain class [3]. There are two main types of approaches for CNNs applied in hyperspectral data classification: spectral analysis and spectral-spatial analysis. The methods based on spectral signatures regard the original spectral vectors or a reasonable number of spectral channels as the input data for HSI classification. In [13], [22], 1D CNNs were employed to capture deep spectral features of pixels for HSI classification. Charmisha et al. [23] proposed a vectorized CNN to perform dimension reduction. Zhan et al. [14] used 1D generative adversarial network to learn the spectral features. 1D CNN only uses the spectral information, but the spatial information is ignored. Actually, spatial information has been reported to be very useful in improving the representation of hyperspectral data and increasing the classification accuracies [2], [24]. Some works have explored 2D CNN for extracting spatial features of pixels. In 2D CNN framework, the spectral features are usually processed by dimension reduction methods. In [25], [26], the authors extracted the first principal component as spectral features and then employed the 2D CNN to extract the spatial features for HSI classification. Song et al. [27] adopted residual learning to extract deep features and fused the features of hierarchical layers to improve the classification accuracy. Zhu et al. [28] proposed a deformable CNN-based method, in which the authors compressed adjacent similar structural information into fixed grids to extract features. In [29], the authors decoupled the feature maps of input patches into multiple response maps and adaptively selected the meaningful maps for classification.
However, the existing spectral-based methods and some 2D spectral-spatial-based methods only use spectral features or capture local spatial features of the pixels. The performance of these methods is restricted as a result of not exploring both spectral and spatial features simultaneously. Recently, 3D CNN can extract the spectral-spatial features of HSI concurrently, which has attracted the interest of many researchers. Ying et al. [15] directly employed 3D CNN to extract deep spectral-spatial features for HSI classification. Chen et al. [26] used 3D CNN with regularization to obtain spectral-spatial features for HSI classification. Zhong et al. [30] designed a 3D spectral and spatial residual block which can consecutively learn the deep spectral-spatial features. Mei et al. [31] used a 3D convolutional autoencoder to learn spectral-spatial features without supervision. HSIs are data cubes in which spectral and spatial information coexist, and 3D CNN filters are a natural method for discovering the spectral-spatial features within such images. To explore the spectral-spatial features as a whole, our method employs 3D CNN for extracting basic spectral-spatial features. The existing methods directly use the basic features or select key features from them for HSI classification. Different from them, our method focuses on the features that are more relevant to the target pixel and assign them different weights according to their correlation levels through CAM, and then sums them up to generate new spectral-spatial features with more discriminative characteristics.

B. Attention Mechanism
As a research hotspot in computer vision, attention mechanism has been widely used in various fields of deep learning, such as machine translation [32], object recognition [33], pose estimation [34], saliency detection [35], and scene segmentation [36]. Fu et al. [36] adopted the position attention module and channel attention module to learn the spatial and channel information separately. Hu et al. [37] employed squeeze and excitation operations to assign different weights to different channels for selecting important feature maps.
Attention mechanism is a method that simulates human visual perception. When a person observes an object, the vision quickly scans the global image, focuses on the key area, and suppresses other useless information and background information. Attention mechanism in computer vision is similar to that in human vision. Its purpose is to focus on some important features of the target and select more critical features from a large number of features. Fig. 2 shows the process of the basic attention mechanism.
As shown in Fig. 2, under the basic attention mechanism, the output feature is the weighted sum of each input feature according to its importance. The formula is as follows: where h out is the output feature, h 1 . . ., h t are the input features, α 1 . . ., α t are the corresponding weights, and t is the number of input features. α i is obtained by a softmax function. It is defined as where F (·) denotes the scoring function and exp(·) denotes the exponential function.
Recently, the attention mechanism has shown great potential in the field of remote sensing. Some researchers have introduced it into HSI classification. Fang et al. [38] exploited 3D dilated convolutions to capture the spectral-spatial features, and then adopted spectralwise attention to enhance the distinguishability of spectral features. Ma et al. [39] applied two types of attention mechanism in two branches to extract spectral and spatial features and then concatenated them for classification. The work in [16] applied the spectral attention Bi-RNN branch for spectral features and applied the spatial attention CNN branch for spatial features. Sun et al. [9] embedded the attention module after both the spectral module and spatial module to suppress the impact of interfering pixels. In early works, the attention mechanism was independently applied to spectral and spatial features and then merged the outputs, or it was sequentially used after spectral modules and spatial modules to select key features. In our method, the CAM is exploited to seek the desired spectral-spatial features that are more correlated with the target pixel and assign them different weights. Then, the CAM sums of the weighted features to get more discriminative features, which not only introduces a target focused strategy but also reduces the number of parameters.

III. PROPOSED METHOD
In this section, we describe the proposed CAN in detail. CAN contains three parts: a 3D CNN module, a CAM, and a classification module. The 3D CNN module is used to capture the basic spectral-spatial features of the target pixel and its adjacent pixels; the CAM aims to fuse these features and generate more discriminative features; the target pixel is classified by the classification module with a softmax function. Fig. 3 illustrates the architecture of our proposed CAN.

A. 3D CNN Module for Spectral-Spatial Features
An HSI is represented in a 3D cube. In the proposed method, to explore both spectral and spatial information simultaneously, the 3D CNN module is employed as a feature extractor, consisting of convolution layers, batch normalization layers, nonlinearity layers, and pooling layers.

1) 3D Convolution Layer:
The convolution layer is a layer where each neuron computes the dot product between its weights and a small region of the input volume matched to it. The layer's goal is to identify certain features from the previous layer and transform them to feature maps. It is formulated as follows [17]: where I is the input volume, O is the output volume, W is the filter (neuron or kernel) with the size k 1 × k 2 × k 3 , b is the bias, d × r × l represents the size of the input volume, and d × r × l represents the size of the output volume. ⊗ denotes the convolution operation. Fig. 4 shows the process of 3D convolution. Generally, multiple 3D convolution filters are stacked in one layer to explore different kinds of spectral-spatial features. The 3D convolutional layer can produce many spectral-spatial feature maps. When 3D convolutional layers are connected sequentially, more abstract spectral-spatial features are extracted.
2) Batch Normalization Layer: This layer is often used to improve the numerical stability. The batch normalization [40] is represented as wherex is a minibatch of inputs, mean[x] and var[x] represent the mean and standard deviation ofx which are calculated over  a minibatch, γ and β are the learnable parameters, and is a very small constant value.

3) Nonlinearity Layer:
This layer is applied to learn the nonlinear relationship contained in the previous volume by leveraging a nonlinear function. In this article, we adopt the rectified linear unit (ReLU) [41] as the nonlinear function. It is defined as

4) Pooling Layer:
This layer is often used to summarize the features and reduce the feature dimensions through a pooling function. In our proposed method, 3D max pooling is applied to extract spectral-spatial features after the nonlinear layer. The 3D max pooling operation takes the maximum value within a small spatial region of the input volumes, and it is defined as where I p+δ p ,q+δ q ,z+δ z represents the input values at position (p, q, z) with a region of size (δ p , δ q , δ z ) and O p,q,z represents the output value at position (p, q, z) after 3D max pooling.

B. Center Attention Module
The basic attention mechanism automatically selects the key features and ignores trivial features. The key features selected usually represent the major information of the samples. However, when the samples contain considerable disturbing information, the key features selected may not correctly represent the salient information of the target, thus leading to classification mistakes. Taking HSI classification as an example, when the target pixel is on the edge of two or more classes of land covers, as shown in Fig. 1(c)-(e), there will be many interfering pixels around it. The interfering pixels often have different labels from the target pixel. Sometimes, they occupy the majority in the neighborhood of the target pixel. Under these cases, the key features selected by the basic attention may not correctly represent the target pixel, leading to classification mistakes. How to accurately extract and choose better features representing the target pixel becomes the core of improving classification accuracy.
To address this problem, we propose a novel CAM to seek the desired spectral-spatial features which are more discriminative for the classification. The CAM focuses more on the features that are highly correlated with the central pixel (i.e., the target pixel) in the subcube sample. Since the convolution filters scan the samples sequentially, the central filtered features can often better represent the central (target) pixel. So we calculate the correlation scores between all the features and the central features to evaluate the contribution of different features for the classification of the target pixel. Then, the CAM exerts unequal weights on these features according to their correlation scores. The stronger the correlation, the greater the weight, and vice versa. Finally, it sums up these features of different weights to reduce the number of features and generate new spectral-spatial features. These new features are more discriminative for classification in that they are more relevant to the target pixel. The details of the CAM are shown in Fig. 5.
As illustrated in Fig. 5, first, we take an output of the prior convolution block as the input feature map H ∈ R s×s×m for CAM. Then, we perform three convolution layers with the kernel size of (1 × 1 × 1) [42], [43] and three ReLU layers on H separately, and generate three different new feature maps H 1 , H 2 , and H 3 . The (1 × 1 × 1) convolution layers and ReLU layers are used to enhance the nonlinear representation of the features. In fact, H 1 and H 2 are to calculate the attention vector that softly weights the importance of different features, and the goal of H 3 is a simple nonlinear transformation of the input feature map but with suitable dimensions. In order to do matrix operations, H 1 and H 3 are reshaped into U 1 , U 3 , where U 1 , U 3 ∈ R ss×m and ss = s × s. The center feature vector h center is extracted from the center of H 2 , where h center ∈ R 1×m . Both U 1 and h center are fed into the scoring function F (·) to calculate the correlation scores between them, and then a softmax function are applied on correlation scores to calculate the attention vector α. Finally, the output feature vector h out of the CAM is obtained by multiplying matrices α and U 3 , where h out ∈ R 1×m . It is formulated as where α ≡ [α 1 , . . ., α ss ], U 1 ≡ [h 1 , . . ., h ss ] T , α i is obtained by the softmax function, h center is the center feature vector, and exp(·) denotes the exponential function. F (·) denotes the scoring function, which is implemented by a full connection layer, parameterized by a weight matrix, W ∈ R ss×ss . g i are used to calculate the correlation between h i and h center , and m is the length of h center . The correlation scores are obtained by multiplying all the g i with W and activating the results by a nonlinear function δ(·), i.e., ReLU. They are formulated as

C. Center Attention Network
CAN is an end-to-end method based on patch for HSI classification. It takes the target pixel and its neighbor pixels together as a subcube sample (i.e., patch) whose class label is that of its central pixel. The method mainly contains three parts: the 3D CNN module, the CAM, and the classification module. Fig. 3 portrays the architecture of the proposed method.
First, many subcube samples are cropped from the dataset. Next, basic spectral-spatial features are extracted from the 3D CNN module built with two sequential 3D convolution blocks. Each block consists of a convolutional layer, a batch normalization layer, a ReLU layer, and a max pooling layer. Then, the CAM assigns different weights to these spectral-spatial features according to their relevance to the target pixel and then sums up these weighted features to generate more discriminative features. Its detailed process is shown in Fig. 5. Finally, these new The label values are determined by a classifier with a softmax function. The classifier is composed of fully connected layers, a batch normalization layer, a ReLU layer, and a softmax layer. The categorical cross entropy is employed as the loss function, defined as where c is the number of land-cover classes, p i is the output of the CAN, y i is the label value, and y i ∈ {0, 1} (if y i is the ith class y i = 1, else y i = 0). For its robustness in learning, the Adam [44] optimizer is adopted. Table I specifies the detailed parameters of each layer in the proposed method. In Table I, c is the number of land-cover classes.

IV. EXPERIMENTS
To evaluate the effectiveness of our proposed method for HSI classification, we conducted a series of experiments on three public datasets. Experimental results demonstrate that the proposed method achieved better results compared with several state-of-the-art methods.

A. Datasets
The datasets used in the experiments were Indian Pines (IP), University of Pavia (UP), and Salinas Valley (SV), which are widely used in the validation of HSI classification methods. Next, we introduce these datasets in detail.
1) Indian Pines: The IP dataset was collected in northwest Indiana by the AVIRIS sensor in 1992. It includes 220 spectral bands from wavelengths of 400-2500 nm with an interval of 10 nm. There are 200 usable bands left after the removal of the water absorption and null bands. The size of the image is 145 × 145, and the spatial resolution is 20 m. In this dataset, 16 different land-cover categories are included, with a total of 10 249 labeled pixels. Fig. 6 shows the pseudocolor image and the ground-truth map of the IP dataset.
2) University of Pavia: The UP dataset was acquired through the ROSIS sensor in 2003. It includes 115 bands. After the noise  bands were removed, 103 available bands remained. The wavelength range is from 380 to 860 nm, and the spatial resolution is 1.3 m. The size of the image is 610 × 340. There are nine types of land cover and a total of 42 776 labeled samples in the UP dataset. Fig. 7 shows the pseudocolor map and the ground-truth map of the UP dataset.
3) Salinas Valley: The Salinas Valley dataset was acquired by the AVIRIS sensor in 1998. It includes 224 bands, with a wavelength range from 400 to 2500 nm. After removing the water absorption and noise bands, 204 bands remained. The size of the image is 512 × 217, and its ground resolution is 3.7 m. In this dataset, there are 16 types of land cover and a total of 54 129 labeled samples. Fig. 8 shows the pseudocolor map and the ground-truth map of the SV dataset.

B. Experimental Settings and Measures
In this section, data preprocessing and data augmentation methods are introduced. In data preprocessing, we normalize the data with maximum and minimum values. Then, the normalized data are subtracted from the average value of the corresponding band.
In data augmentation, we reverse and rotate the subcubes to alleviate the overfitting problem due to insufficient labeled samples. First, a subcube sample is flipped horizontally and vertically; second, the subcube sample is rotated 90, 180, and 270 degrees around the central pixel. After these operations, each subcube sample generates five additional samples. In addition, batch normalization adopted in the proposed method can also relieve the overfitting problem.
To quantitatively analyze the performance of the algorithm, we use the overall accuracy (OA), average accuracy (AA), and kappa as evaluation measures. OA refers to the proportion of all correctly classified samples in the test samples; AA refers to the average classification accuracy of different categories; and kappa measures the consistency between classification results and ground truth. The larger the value of OA, AA, and kappa, the better the results.

C. Comparing With Other Methods
To verify the performance of the proposed CAN method, we perform a comparison between the proposed method and several state-of-the-art methods, including 1D CNN, 2D CNN [26], SMBN (squeeze multibias network) [29], DFFN (deep feature fusion network) [27], DHCNet (deformable HSI classification networks) [28], SSRN (spectral-spatial residual network) [30], and SSAN (spectral-spatial attention networks) [9], and they are all based on deep learning with CNN modules. To make a fair  II  NUMBER OF TRAINING SAMPLES, TESTING SAMPLES, AND TOTAL SAMPLES ON  THE INDIAN PINES DATASET   TABLE III  NUMBER OF TRAINING SAMPLES, TESTING SAMPLES, AND TOTAL SAMPLES ON  THE UNIVERSITY OF PAVIA DATASET comparison, our method and comparison methods proposed in this article use the same experimental settings, including data preprocessing and data augmentation. The detailed parameters are set as follows. The spatial size of the HSI subcube of all methods is set to 7 × 7. The number of training epochs is set to 200, but 1000 for 1D CNN because it is trained without data augmentation. The number of batch sizes is 100. The weight parameters of each method are optimized by Adam [44]. The learning rates of the competitive methods are the same as those of the original paper. The learning rate of the proposed method is 0.001. These experiments are conducted on the IP, UP, and SV datasets. On the IP dataset, we randomly select 10% of the labeled samples in each land-cover category as training samples, and the rest are test samples. On the UP and SV datasets, 2% of the labeled samples are randomly selected as training samples, and the rest of the labeled samples are test samples. The number of training and test samples belonging to different categories on the IP, UP, and SV datasets are shown in Tables II-IV. Table V shows the classification results of different methods on the IP dataset, Table VI on the UP dataset, and Table VII on the SV dataset [16]. We highlight the best results in italic.
From the results in Tables V-VII, it is obvious that the methods based on spectral-spatial features show superior performance over the method based on only spectral features (1D  IV  NUMBER OF TRAINING SAMPLES, TESTING SAMPLES, AND TOTAL SAMPLES ON  THE SALINAS VALLEY DATASET CNN). This demonstrates that the spatial information is helpful for improving classification performance. We also find that the SSAN method and the proposed method outperform other methods based on spectral-spatial features because the two methods manage to select the required features and suppress unwanted ones. This indicates that some features extracted by CNN are redundant, and many of them are useless or even counterproductive. This also affirms that the CAM in our method is necessary for selecting and summing up these basic spectral-spatial features.
From these classification results, we further discover that the classification accuracies vary greatly among different categories because the number of samples belonging to different classes is unequal, resulting in an imbalance among the training samples, especially on the IP dataset. The category with the fewest samples is "Oats," which has only 2 samples, but the "Soybean-m" category has 246 samples. This imbalance between the number of training samples poses a major challenge to classification methods. In terms of AA, the proposed method is better than comparative methods when the dataset is unbalanced.

D. Impact of Spatial Size
The spatial size of the subcube has an important impact on the classification results [3]. In this section, we conduct several experiments on the IP, UP, and SV datasets to explore the impact of size on the classification results. The ratios of labeled samples on the IP, UP, and SV datasets are 10%, 2%, and 2%, respectively. The spatial sizes of the subcube are set to 5 × 5, 7 × 7, 9 × 9, 11 × 11, and 13 × 13. The number of training epochs is 100, and the number of batch sizes is 100. All the other parameters retain the settings of the previous experiments. Fig. 9 shows the OAs of the proposed method on the three datasets with different spatial sizes.
From Fig. 9, we find that the classification performance gradually improves as the spatial size expands. The reason is that a larger sample may contain more spatial information. However, the effect of the increased spatial size on the classification performance is different on the IP, UP, and SV datasets. When the size is 13 × 13, the performance on the IP dataset begins to decline slightly. This is because as the size increases, there will be more interfering pixels in subcube samples, which may affect the classification performance of the method. Therefore, a suitable spatial size is very important. In the following experiments, the spatial size is set to 11 × 11.

E. Effectiveness of CAM
The CAM plays an essential role in the proposed method. It effectively fuses the spectral-spatial features, which greatly reduces the number of parameters and improves the training efficiency of the method. Under the same configurations, the   number of parameters of our method with and without CAM is shown in Fig. 10, and their training times are shown in Table VIII. The results in Fig. 10 suggest that the number of parameters of the method with CAM is much less than that of the method without CAM. According to Table VIII, the time spent by the method with CAM is less than that of the method without CAM. The reduction in parameter quantity and the improvement in training efficiency benefit from weighted sum of spectral-spatial features by the CAM.
The CAM helps generate more relevant and discriminative spectral-spatial features and improves the classification performance. To verify the effectiveness of the CAM, the experiments are conducted on the IP, UP, and SV datasets with and without CAM for comparison. In these experiments, the spatial size is 11 × 11, the number of training epochs is 100, the batch size is 100, and the optimizer is Adam.    To visually display and verify the classification results, Figs. 11-13 portray the classification results of the method with and without CAM on the IP, UP, and SV datasets. In these figures, "*" (red asterisk) represents the misclassified labeled samples, and others are correctly identified labeled samples (the black area is the background pixels).
From Figs. 11 to 13, it can be seen that the CAM is effective and helpful for improving the classification performance, especially at the boundary of different classes. There are many   11-13(b), especially on the boundary. The main reasons are as follows: 1) the proposed method finds the internal correlation between the target pixel and its neighboring pixels, and the pixels with higher correlation are more contributive to the classification; 2) through the CAM, the proposed method effectively fuses the spectral-spatial features and generates more relevant and discriminative features.

F. Impact of Training Ratios
In actual applications, the number of training samples is an important factor for classification accuracy. In this section, we explore the performance of the proposed method with different ratios of labeled samples. The ratios of labeled samples are set as 2%, 5%, 10%, 15%, and 20%, respectively. The result of 2% on the IP dataset is null because some categories of samples are so small that there are no samples. Table X shows the classification performance of the proposed method on the three datasets with different percentages of labeled samples as training samples.
In Table X, we can observe that the classification accuracies improve as the ratios of labeled samples increase. This proves that the spectral-spatial features learned by the proposed method are effective for HSI classification. We also find that when a small number of samples (the IP dataset is 5%, the UP and SV datasets are 2%) are available, satisfactory results can be obtained by the proposed method as well. Achieving good results with fewer training samples is crucial for HSI classification since the labeled samples are often difficult to collect in actual situations.

V. CONCLUSION
In this article, we propose an end-to-end hyperspectral image classification method by introducing a CAM into 3D CNN to enhance the classification accuracy of hyperspectral images. Specifically, this method effectively learns the internal correlation between the central pixel and its neighboring pixels in a subcube sample and generates more discriminative spectral-spatial features. Experimental results demonstrate that our method has exceeded several state-of-the-art HSI classification methods based on deep learning, and it still retains its functionality even with an inadequate number of labeled samples. In addition, the method effectively fuses the basic spectral-spatial features extracted by the 3D CNN module, which significantly reduces the number of parameters and improves the training efficiency.