Pyramidal Multiscale Convolutional Network With Polarized Self-Attention for Pixel-Wise Hyperspectral Image Classification

In recent years, pixel-wise hyperspectral image (HSI) classification has received growing attention in the field of remote sensing. Plenty of spectral–spatial convolutional neural network (CNN) methods with diverse attention mechanisms have been proposed for HSI classification due to the attention mechanisms being able to provide more flexibility over standard convolutional blocks. However, it remains a challenge to effectively extract multiscale features of high-resolution HSI in a real-world complex environment. In this article, we propose a pyramidal multiscale spectral–spatial convolutional network with polarized self-attention for pixel-wise HSI classification. It contains three stages: channel-wise feature extraction network, spatial-wise feature extraction network, and classification network, which are used to extract spectral features, extract spatial features, and generate classification results, respectively. Pyramidal convolutional blocks and polarized attention blocks are combined to extract spectral and spatial features of HSI. Furthermore, residual aggregation and one-shot aggregation are employed to better converge the network. The experimental results on several public HSI datasets demonstrate that the proposed network outperforms other related methods.


I. INTRODUCTION
H YPERSPECTRAL image (HSI) is obtained by the remote sensor and contains hundreds of continuous and narrow spectral bands ranging from visible to short-wave infrared. HSI can effectively characterize interesting land cover objects [1] and has been widely used in many research fields, such as urban planning [2], environmental monitoring [3], fine agriculture [4], mineral exploration [5], [6], and military Manuscript [7]. With the rapid development of remote sensing technology and hyperspectral imaging technology, it has been easier to acquire HSI datasets. However, the analysis and process of the HSI datasets remain insufficient [8].
The pixel-wise classification of HSI, which appears as an important issue of HSI processing technology, achieves a phenomenal interest of researchers and has been studied by many scholars in recent years [9], [10]. The purpose of the pixel-wise classification is to assign a unique category label to each pixel of the HSI dataset. Traditional machine learning HSI classification approaches use handcrafted features to train the classifier, such as local binary patterns (LBPs) [11], histogram of oriented gradients (HOG) [12], global image scale-invariant (GIST) [13], K-nearest neighbors (KNN) [14], extreme learning machine (ELM) [15], and support vector machine (SVM) [16]. Although these handcrafted features can effectively represent various shallow attributes of HSI, the robustness and discriminability of the methods are difficult to be maintained in complex real-world remote sensing environments. Furthermore, the parameter setting and domain knowledge also limit the usage of handcrafted features in HSI classification tasks. In contrast, deep learning methods can automatically learn the shallow features and deep semantic information from HSI dataset in a hierarchical manner, which has shown great potential for feature representation in HSI classification tasks [17], [18].
In recent years, many deep learning-based frameworks have been proposed, such as recurrent neural networks (RNNs) [19], convolutional neural networks (CNNs) [20], graph convolutional neural networks (GCNNs) [21], and generative adversarial neural networks (GANNs) [22]. Among these frameworks, the CNN framework, which has been widely used in RGB image processing, is applied to pixel-wise HSI classification for its excellent performance. CNN employs spatial weight sharing of the convolutional kernel to reduce the computational complexity and uses activation functions to add nonlinearities to the network. According to the extracted features, the CNN-based frameworks can be divided into three types: spectral CNN, spatial CNN, and spectral-spatial CNN [23]. The spectral CNNs take advantage of the abundant spectral signature of HSI and exploit the spectral features (1-D vector) to improve the classification accuracy. For example, Hu et al. [24] proposed deep CNNs to classify HSIs directly This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ in the spectral domain. Five layers are implemented on each spectral signature to discriminate against others. The experimental results show that the proposed method can obtain better accuracy than some traditional methods. In [25], each 1-D spectral vector of a pixel is transformed into a 2-D spectral feature matrix to get rid of the bondage of strong correlation among bands for HSI classification. The 1 × 1 and 3 × 3 convolutional layers are implemented in the CNN framework to better deal with HSI information and accomplish feature reuse. Jin et al. [26] propose a deep neural network classification model for the pixels of wheat HSI to accurately discern the disease areas. In this model, the pixel spectra data are reshaped into a 2-D data structure. A hybrid network framework with a convolutional layer and bidirectional recurrent layer is reconstructed to improve the generalization of the model. In [27], comparisons are conducted among KNN, SVM, and CNN models in the spectral dimension of HSIs over four rice seed varieties. The result shows that the CNN model performs better than the corresponding KNN and SVM in most cases. Although spectral CNNs achieve better results than traditional classification methods, the CNNs are constrained to extract the spectral signatures of HSI, while the spatial information is insufficiently utilized. In contrast, spatial CNN models employ a spatial map (2-D matrix) as the input data to extract the spatial information from the HSI dataset. For example, Li et al. [28] use principal component analysis (PCA) to extract the first principal component (PC) with refined spatial information and propose a full CNN with convolution, deconvolution, and pooling layers to enhance the deep features. After the feature enhancement, the optimized ELM is utilized for classification. Xu et al. [29] propose a random patches network for HSI classification, which uses 2-D convolutional kernels in the CNN framework. In [30], Gabor filters are employed to combine with the 2-D convolutional filters for HSI classification to mitigate the problem of overfitting. The classification results show that the proposed model provides competitive results. In [31], a spatial CNN framework is proposed for HSI classification embedded with an extracted hashing feature. The proposed CNN achieves a powerful distinguishing ability from different classes. Although spatial CNNs can effectively extract the spatial information of HSI pixels to improve the classification accuracy, CNNs inevitably lose a large amount of spectral information. To avoid this problem, spectral-spatial CNN is naturally implemented for pixel-wise HSI classification, which can jointly extract spectral and spatial information from the HSI dataset. The input data of the spectral-spatial CNN is 3-D cube data, which is always a square HSI data cube cropped centered on the corresponding pixel. The spectral-spatial CNN has greatly improved the classification accuracy and is the dominant research of the HSI classification. For example, Li et al. [32] proposed a 3-D CNN framework to extract the deep spectral-spatial combined features of the HSI dataset. The experimental results show that the proposed 3-D CNN method outperforms the stacked autoencoder, deep brief network, and 2-D CNN network. Zhong et al. [33] design an end-to-end spectral-spatial residual network (SSRN) that takes raw 3-D cubes as input data for HSI classification. The residual blocks of the network consec-utively learn discriminative features from spectral signatures and spatial contexts in HSI. The experimental results show that the proposed network achieves competitive HSI classification accuracy in agricultural, rural-urban, and urban datasets. Zhang et al. [34] propose a 3-D lightweight CNN for limitedsamples-based HSI classification. Two learning strategies are proposed to further alleviate the small sample problem, which are the cross-sensor strategy and the cross-modal strategy. Experiments demonstrate that the proposed network achieves competitive performance for HSI classification. Roy et al. [35] propose a bilinear fusion mechanism for HSI classification. The excitation operation is performed using the fused output of the squeeze operation. The experimental results confirm the superiority of the proposed method. Jia et al. [36] also propose a lightweight CNN. The spatial-spectral Schrodinger eigenmaps feature extraction is first adopted to obtain the joint spatial-spectral information. A dual-scale convolutional module is designed to address the spatial-spectral features and obtain the hierarchical structure description of the dataset. The features are addressed by a bichannel fusion module and are imported into a global average pooling classifier to achieve the classification results.
Although the spectral-spatial CNN has significantly improved the accuracy of HSI classification, there are still some problems to be solved, such as convergence of deep network, limitation of labeled samples, and extraction of complex land cover objects [37]. To address these issues, scholars strive to optimize existing spectral-spatial CNN frameworks. To solve the problem of deep network convergence, residual blocks and densely connected structures are introduced to improve the CNN frameworks. For instance, Wang et al. [38] propose a fast dense spectral-spatial convolutional framework for HSI classification. Different convolutional kernel sizes are used to extract spectral and spatial features separately. Densely connected structures are used for deep learning of features. Paoletti et al. [39] propose a residual-based CNN approach, which is grouped in pyramidal bottleneck residual blocks, to involve more locations as the network depth increases to preserve the time complexity per layer. Meanwhile, the multiscale strategy is implemented to construct the CNNs to better use the limited samples and extract features of complex land cover objects. For example, Liu et al. [40] propose a 2-D-3-D CNN with spectral-spatial multiscale feature fusion for HSI classification. The network employs two diverse backbone modules for feature representation. A hierarchical feature extraction module is used to capture multiscale spectral features, and a multilevel fusion structure is used to extract multistage spatial features. In [41], a multiscale selflooping CNN is proposed for HSI classification. Each layer in a self-looping block contains both forward and backward connections, which can efficiently fuse the shallow and deep features extracted by different layers. Furthermore, the dualbranch strategy is introduced to the spectral-spatial CNN framework. Wang et al. [42] propose a dual-branch dense residual network for HSI classification. One branch is based on 1-D convolution, which is used to extract spectral features. Another branch is based on 2-D convolution, which is used to extract spatial features. Residual units and dense structures are introduced to fuse the information of different convolutional layers. The experimental results show that the proposed method achieves superior classification performance compared with the state-of-the-art methods. At the same time, attention mechanisms are employed to combine with convolutional layers to make the network more flexible. Li et al. [43] propose a spectral-spatial network with channel and position global context attention (SSGC) for HSI classification. Pan et al. [1] propose a one-shot dense network (OSDN) with polarized attention for HSI classification. The one-shot units are used to maintain the information of different layers, and the polarized attention is used to extract the high internal resolution spectral and spatial information. It is worth noting that the aggressive improvements effectively enhance the performance of spectral-spatial CNN frameworks, and the improvements of spectral-spatial CNNs are not limited to the abovementioned methods.
In this article, we propose a pyramidal multiscale spectralspatial CNN (PMCN) with polarized attention for pixelwise HSI classification. The proposed network contains three stages: channel-wise feature extraction network, spatial-wise feature extraction network, and classification network. The channel-wise feature extraction network is used to extract the spectral features of the HSI dataset, and the spatialwise feature extraction network is used to extract the spatial features. The classification network is used to obtain classification results. Pyramidal multiscale convolutional blocks and polarized self-attention (PSA) blocks are combined to extract complex spectral and spatial features with high resolution. Batch normalization (BN) [44], parametric rectified linear unit (PReLU) [45], and Mish [46] are implemented to maintain the stability and nonlinearity of the network. Furthermore, residual aggregation and one-shot aggregation are introduced to better converge the network. Finally, the classification network is used to fuse the features and obtains the classification results. The main contributions are summarized as follows.
1) We improve the traditional pyramidal multiscale convolutional block that uses the pseudo-3-D multiscale spectral convolutions and spatial convolutions to construct spectral feature extraction blocks and spatial feature extraction blocks, respectively. This approach can reduce the complexity of the proposed network without reducing the classification accuracy and make the network easier to be trained. 2) The residual aggregation and one-shot aggregation are jointly employed in the proposed network. This approach can effectively maintain the shallow feature of the low-level layers so that the network can adequately integrate the features of different layers for better convergence and improve the efficiency of the proposed network.
3) The polarized attention mechanism is used to help the multiscale convolutional blocks to extract spectral and spatial features. This approach can effectively extract the segment that needs to be noticed based on the characteristics of the input feature map and is an attractive complement to standard multiscale convolutional blocks at high internal resolution. The rest of this article is organized as follows. Section II introduces the related work, such as the cube-based HSI classification framework, pyramidal convolution (PyConv), attention mechanism, and aggregation methods. The details of the proposed network are given in Section III. Section IV lists the experimental results, and Section V makes some discussions. Section VI gives the conclusion of this article and discusses future work.

II. RELATED WORK A. Cube-Based HSI Classification Framework
To extract the spectral-spatial features of the HSI, the cubebased method is introduced to pixel-wise HSI classification [47]. In this method, a square HSI data cube is cropped and centered on the corresponding pixel, which is utilized as the input data of the network. The land cover label of the 3-D cube is determined by its central pixel. To be specific, giving an HSI dataset X ∈ R D×H ×W and the land cover label of the ith pixel y i ∈ {1, 2, . . . , m}, where D is the number of channels (spectral dimensions), H × W is the spatial size of the HSI dataset, m is the number of land cover categories. The HSI data cube of the ith pixel can be described as x i ∈ R D×h×w , which is centered on the ith pixel in spatial dimension and the spatial size is h × w. In general, we can denote the ith labeled pixel as (x i , y i ).

B. Pyramidal Convolution
The PyConv [37] is a multiscale 3-D convolutional network architecture that uses a local multiscale context aggregation module and a global multiscale context aggregation block to parse the input feature map. Different from the standard convolution, PyConv enlarges the receptive field of the kernel and applies different types of kernels with different spatial and spectral resolutions in parallel. The structure of the PyConv is illustrated in Fig. 1. Given the feature map FM i ∈ R C×h×w , where C is the number of channels and h × w is the spatial size, PyConv uses different types of 3-D kernels in a pyramid that produces a series of outputs and aggregates the outputs into an output feature map FM out ∈ R C×h×w . In general, the size of the 3-D kernels can be varied into two directions: spatial-wise and channel-wise. As can be seen from Fig. 1, the spatial size of the kernels increases from the bottom of the pyramid to the top, and the channel size of the kernels simultaneously decreases. The pyramidal structure provides a pool of combinations with different types and sizes of kernels. The architecture can possess the ability to acquire complementary information so that the smaller receptive fields can focus on small objects and the larger receptive fields can dedicate feature maps to the larger objects and the contextual information.

C. Attention Mechanism
Benefiting from the human perception process, the attention mechanism is designed to focus more on the informative areas and takes less into account nonessential areas [48]. It obtains linear weights to represent the contributions to extract features based on the correlations between objects, which can be interpreted as a method of feature transformation. The attention mechanism, which is used to address the weakness of standard convolutions [49], has shown excellent performance in various tasks, such as image categorization, image caption, text-toimage synthesis, and scene segmentation [50].
Self-attention [51], [52] is a kind of attention model that uses an input tensor to compute the attention weights and reweights the input tensor by these weights. In general, it works as a standard component to capture long-range interactions. As a result, self-attention models are always inserted after convolutional blocks to augment the network to handle both short-and long-range dependence. In this article, a powerful self-attention for pixel-wise regression, named PSA [53], is introduced to the proposed network. It keeps high internal resolution and fuses SoftMax-sigmoid composition in both channel-only and spatial-only attention blocks. The detailed implementation is described in Section III-C.

D. Residual Aggregation, Dense Aggregation, and One-Shot Aggregation
In general, deep neural networks have a powerful ability to extract abstract information from input datasets that can provide effective support for downstream tasks. However, as the neural network deepens, the gradient dispersion/explosion phenomenon and network degradation phenomenon often prevent the network to be successfully converged. To address these issues, residual aggregation, also known as residual connection or identity mapping, is proposed [54]. As shown in Fig. 2, we can see that a skip connection is added to the basic traditional deep neural network. H is the hidden layer that represents several convolutional layers with BN layers and activation layers, and ⊕ is a summation operator. The skip connection allows the input feature map to be passed directly to the subsequent layers in a summative way. The output feature map of the lth hidden layer can be expressed as (1) As mentioned by Zhu et al. [55], information carried by early feature maps would be washed out as it is summed with others. To better maintain the early information, dense aggregation is proposed [56]. Different from residual aggregation, dense aggregation utilizes the concatenation operator to converge the feature maps that preserve information in its original form. As shown in Fig. 3, all previous feature maps of the early layers can be used to compute the output of the lth layer We can see that if the hidden layer H l produces k feature maps, the input of the H l+1 will be k 0 + k × l input feature maps, where k 0 is the size of the input dataset, while the output of H l+1 will still be k feature maps. However, we find in the experiment that networks with dense aggregation spend more energy and time than those with residual aggregation. To improve the dense aggregation to be more efficient, one-shot aggregation [57] is proposed which can preserve the benefit of concatenative aggregation for feature extraction. As shown in Fig. 4, one-shot aggregation aggregates intermediate features at once. Experiments show that one-shot aggregation provides great benefits to computation efficiency while preserving the advantage of dense aggregation.

III. METHODOLOGY
In this section, we first introduce the framework of the proposed network in detail. Second, channel-wise and spatialwise pyramidal convolutional blocks are described. Finally, the implementation of PSA blocks is discussed.

A. Framework of the PMCN
The structure of the PMCN is shown in Fig. 5. We can see that the proposed network can be divided into three parts: channel-wise feature extraction network, spatial-wise feature extraction network, and classification network. The channelwise feature extraction network is composed of three channelwise pyramidal convolutional blocks, one channel-only block of PSA, and four convolutional layers. Residual aggregation and one-shot aggregation are utilized to preserve early information. The spatial-wise feature extraction network layouts after the channel-wise feature extraction network. Similar to the channel-wise feature extraction network, the spatialwise feature extraction network is composed of three spatialwise pyramidal convolutional blocks, one spatial-only block of PSA, and two convolutional layers. One-shot aggregation is implemented among the spatial pyramidal convolutional blocks. BN and PReLU are arranged in appropriate locations to maintain the stability and nonlinearity of the network. Finally, the classification network is assigned to provide the classification result, which contains an average pooling layer, BN layer, Mish, and linear layer. The average pooling layer is used to concentrate features from extracted feature maps. The BN layer is applied to stabilize the network and make the network easier to be converged. The Mish activation function is employed to provide a wider range of values for the output data. The linear layer is implemented to provide final classification results. Assuming the input data are x i ∈ R D×h×w , where x i is the cube-based HSI data of ith pixel, D is the number of channels, and h × w is the spatial size of the data, the output of the network is y ′ i ∈ R 1×m , where m is the number of land cover categories. To be specific, we take the input dataset x i ∈ R 103×15×15 as an example to specify the data flow of the network. The detailed steps of the proposed network are shown in Table I. Cross-entropy loss is used to train the proposed network, which can be expressed as where y i is the land cover label of the ith pixel, L i is the cross-entropy loss of the ith pixel. In addition, early stopping and dynamic learning rate [48] technologies are also implemented to reduce the training time and provide better network convergence.

B. Channel-Wise and Spatial-Wise Pyramidal Convolutional Blocks
In the proposed network, pyramidal convolutional blocks are introduced to extract multiscale information from feature maps. Different from the traditional PyConv in which the  channel and spatial size of 3-D convolutional kernels vary jointly, we clearly separate the kernels into channel-wise kernels and spatial-wise kernels. The size of the multiscale kernels only varies in the channel or spatial dimension, which can effectively reduce the computation complexity of the network. As a result, two kinds of pyramidal convolutional blocks are conducted: channel-wise pyramidal convolutional blocks and spatial-wise pyramidal convolutional blocks and are used in the channel-wise feature extraction network and spatial-wise feature extraction network, respectively. In addition, instead of segmenting the input data as the traditional PyConv does, we use the complete input data directly for feature extraction to maintain the integrity of the feature maps.
To be specific, the structures of the channel-wise and spatial-wise pyramidal convolutional blocks are illustrated in Figs. 6 and 7. Assuming the input data are FM i , we can see that the channel-wise pyramidal convolutional block contains three convolutional layers with (7 × 1 × 1), (5 × 1 × 1), and (3 × 1 × 1) kernels to extract multiscale features. After that, the concatenation operator is conducted to converge the features. BN and PReLU are used to provide stability and nonlinearity for the network. Finally, convolutional layers with BN and PReLU are used to reduce the dimension of the feature maps and provide the output (FM out ). The spatial-wise pyramidal convolutional block contains three convolutional layers with (1 × 7 × 7), (1 × 5 × 5), and (1 × 3 × 3) kernels to extract multiscale spatial features. Similar to the channel-wise pyramidal convolutional block, the concatenation operator is conducted to generate the feature maps. After that, the convolutional layer, BN, and PReLU are used to provide the final output.
C. PSA Blocks: Channel-Only Block and Spatial-Only Block PSA is a kind of self-attention mechanism, which designs for high-resolution pixel-wise regression. It can maintain high internal resolution in the computation of the channel and the spatial attention while fully collapsing the input tensors along the corresponding dimensions and composing nonlinearity to fit the output distribution of typical fine-grained regression. To be specific, two kinds of PSA blocks are introduced: channel-only block and spatial-only block. Given the input feature map FM i , the channel-wise attention weight A ch (FM i ) ∈ R C×1×1 can be expressed as where W q , W v , and W z are the 1 × 1 convolutional layers, σ 1 and σ 2 are the tensor reshape operators, F SM (·) is a SoftMax operator, "×" is the matrix dot-product operation, and F SG (·) is a sigmoid operator. The output of the channel-only block is FM ch out , and can be expressed as  where ⊙ ch is a channel-wise multiplication operator. The structure of the channel-only block of PSA is shown in Fig. 8. The structure of the spatial-only block is shown in Fig. 9. The A sp (FM i ) ∈ R 1×h×w can be expressed as where W q and W v are the standard 1×1 convolutional layers, σ 1 , σ 2 , and σ 3 are the tensor reshape operators, and F GP is a global pooling operator. The output of the spatial-only block where ⊙ sp is a spatial-wise multiplication operator.

A. Hyperspectral Dataset Description
In the experiment, five well-known HSI datasets with different land covers and resolutions are used to evaluate the effectiveness of the proposed network, including the University of Pavia dataset (UP), the WHU-Hi-HongHu dataset (HH) [58], the Forest Farm of Gaofeng dataset (GF) [59], the GF-5 advanced HSI dataset (AH) [60], and the Houston University dataset (HU) [61]. The brief views of the five HSIs are described as follows.
1) University of Pavia Dataset: The UP dataset was collected by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor over the University of Pavia, Pavia, Italy, in 2003. The spatial size is 610×340, and the spatial resolution is about 1.3 m per pixel. After dropping 12 noise-contaminated spectral bands, the UP dataset contains 103 bands with a spectral wavelength ranging from 430 to 860 nm. About 21% of pixels are labeled into nine categories, including asphalt, meadows, gravel, trees, metal sheets, bare soil, bitumen, bricks, and shadows. We randomly select 1% of labeled samples as training samples and validation samples, respectively. The remaining labeled samples are used as testing samples. The detailed classes, colors, and the number of samples of the UP dataset are shown in Table II.
2) WHU-Hi-Honghu Dataset: The HH dataset was acquired by the unmanned aerial vehicle (UAV) platform, which is over a complex agricultural area in Honghu City, Hubei Province, China. The spatial size is 940 × 475. The spatial resolution is about 0.043 m per pixel. It contains 270 spectral bands ranging from 400 to 1000 nm. An intercepted area with 16 categories is introduced to our experiment, including Red roof, Road, Bare soil, Cotton, Rape, Chinese cabbage, Pakchoi, Cabbage, Tuber mustard, Brassica parachinensis, Small brassica chinensis, Lactuca sativa, Celtuce, Romaine lettuce, White radish, and Garlic sprout. The spatial size is 240 × 330 ranging in rows (701, 940) and columns (1, 330). We randomly select 1% of labeled samples as training samples and validation samples,  Table III.
3) Forest Farm of Gaofeng Dataset: The GF dataset was acquired by the AISA Eagle II diffraction grating pushbroom hyperspectral imager in 2018 over the Jiepai branch of Gaofeng State Owned Forest Farm, Nanning, Guangxi Province, China. The spatial size is 572 × 906. The spatial resolution is about 1.0 m per pixel. The dataset covers the spectral range of 400-1000 nm with 125 bands. An intercepted area with eight categories is introduced to our experiment, including Cunninghamia lanceolata, Pinus massoniana, Pinus elliottii, Eucalyptus urophylla, Mytilaria laosensis, Camellia oleifera, Road, and Cutting bland. The spatial size is 400×400, which ranges in rows of (1, 400) and columns of (1, 400). We randomly select 1% of labeled samples as training samples and validation samples, respectively. The remaining labeled samples are used as testing samples. The detailed information is displayed in Table IV. 4) GF-5 Advanced HSI Dataset: The AH dataset was obtained by the GF-5 satellite over the Jiangxia District, Wuhan City, Hubei Province, and covers an area of 109.4 km 2 . It is a mixed landscape with mining and agriculture areas, and the types of surface objects are complex. The spatial size is 218 × 561. The spatial resolution is about 30 m. Its spectral range extends from 400 to 2500 nm with 120 bands. The land covers are classified into six categories, including Surface-mined area, Road, Water, Crop land, Forest land, and Construction land. We randomly select 5% of labeled samples
To be specific, the SVM with radial basis function (RBF) kernel is employed as a representative of the traditional method for HSI classification. The HYSN is employed as a representative of the traditional convolutional network. The SSRN is used to represent the traditional convolutional network with residual aggregation. The EMFFN is accepted to represent the multiscale convolutional network. The DBMA and DBDA represent the two-branch convolutional network with attention blocks. The PCIA is employed to represent the pyramidal multiscale convolutional network with attention blocks. The SSGC and OSDN are used to represent the state-of-the-art convolutional network. The competitors are described in detail as follows.
1) SVM: The SVM with RBF kernel is employed in the experiment. The raw spectral vectors of the pixels are fed into the SVM as the input data. The penalty parameter C and the RBF kernel width σ of SVM are selected by Grid SearchCV, both in the range of (10 −2 , 10 2 ).

2) HYSN:
The HYSN is a spectral-spatial 3-D-CNN followed by spatial 2-D-CNN. Three multiscale 3-D convolutional layers with 7 × 3 × 3, 5 × 3 × 3, and 3 × 3 × 3 kernels are used in the method to extract joint spectral-spatial features. One 2-D convolutional layer with a 3 × 3 kernel is used to learn more abstract level spatial features. Two fully connected layers are implemented after 3-D and 2-D layers to provide the final classification results. 3) SSRN: In the SSRN, the spectral and spatial residual blocks are introduced to learn discriminative features from spectral signatures and spatial contexts in HSI. Two kinds of 3-D kernels with 7×1×1 and 1×3×3 window sizes are used in the network to extract spectral information and spatial information, respectively. BN and rectified linear unit (ReLU) operators are added after each convolutional layer. 4) EMFFN: The EMFFN is an enhanced multiscale feature fusion network, which consists of two networks named spectral cascaded dilated convolutional network (CDCN) and parallel multipath network (PMN). The features collected from the two subnetworks are combined into EMFFN using the designed consolidated loss function.
In the CDCN, four dilated 2-D convolutional layers with kernel size 6 × 1 are used to extract the spectral information. The dilation rate d = 2 i (i = 0, 1, 2, 3) is designed for the blocks. A channel attention module is implemented after the dilated convolutional layers to further extract the long-range information. In the PMN, the input data are downscaled to 5-D by PCA. Multiscale 2-D convolutional layers with 7 × 7, 5 × 5, and 3 × 3 kernels are introduced to extract multilevel spatial information. Three parallel paths are used to fuse the multiscale features to leverage both shallow and deep features. 5) DBMA: The DBMA is a double-branch multiattention mechanism network for HSI classification. Two branches networks are used to extract spectral and spatial features, respectively. Two types of attention mechanisms are applied in the two branches. The sizes of the 3-D kernels  For all the competitive networks, the spatial size of the HSI patch cube is set to 11 × 11. The batch size is set to 32. The epoch is set to 200, and the initial learning rate is set to 0.0005. The Adam optimizer is adopted with an attenuation rate of (0.9, 0.999) and a fuzzy factor of 10 −8 . The learning rate is dynamically adjusted every 15 epochs by cosine annealing [64]. Moreover, the early stopping technique is employed in the training process. If the loss on the validation dataset does not change within 20 epochs, the training process will move to the test session. Furthermore, the dropout technique with 0.5 probability is applied to enhance the generalization capability of the model. To quantitatively measure the performance of the competitors, the overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa) are implemented in the experiments. All experiments are repeated five times independently. The average values of the experimental results are reported as the final results. The experimental hardware environment is a deep learning workstation with an Intel Xeon E5-2680v4 processor 2.4 GHz and NVIDIA GeForce RTX 2080Ti GPU. The software environment is CUDA v11.2, PyTorch 1.10, and Python 3.8.

C. Experimental Results
We first assess the performance and training time of the various methods on the UP dataset. The classification results are given in Table VII. The best OA, AA, Kappa, and the largest training time are highlighted in bold. We can see that the proposed PMCN achieves competitive classification results in each category, OA, AA, and Kappa in most cases. Comparing the OAs of the competitors, PMCN achieves 9.07%, 9.67%, 13.85%, 11.74%, 1.87%, 2.55%, 2.45%, 1.76%, and 0.15% of the OA more than that of SVM, HYSN, SSRN, EMFFN, DBMA, DBDA, PCIA, SSGC, and OSDN, respectively. It is because we use the pyramidal multiscale convolutional blocks and PSA blocks to jointly extract spectral and spatial information. Furthermore, we use residual aggregation and one-shot aggregation to maintain the multilevel features of the network, which allows the network to be designed deeper. The OA of SVM is lower than that of deep convolutional networks in most cases, except HYSN, SSRN, and EMFFN. It is because convolutional networks implicitly use the spatial information of the pixels and can be considered the spatial-spectralbased classification method. By obtaining more available information on pixels, deep convolutional networks can achieve better classification results than SVM. Comparing the deep convolutional networks, we can see that the HYSN, SSRN, and EMFFN provide lower OAs than the later networks. It indicates that effective extraction of discriminative spectral and spatial features of the UP dataset is difficult for traditional 3-D and 2-D CNNs. The two-branch networks (DBMA and DBDA) outperform the traditional deep convolutional networks (HYSN, SSRN, and EMFFN). The pyramidal multiscale network (PCIA) provides an OA of 95.43%, which is better than DBDA and less than DBMA. Moreover, the networks using more techniques (SSGC, OSDN, and PMCN), such as two-branch structure, multiscale convolution, attention mechanism, dense aggregation, and one-shot aggregation, achieve better results than those of the former networks. The SSRN and PMCN provide a relatively high standard deviation (SD) of OAs than other methods, which shows that the robustness of the SSRN and PMCN is not strong. PMCN requires the most training time (75.40 s) to train the network, which is discussed in Section V-C. The full-factor classification maps of the competitors are shown in Fig. 10. We can see that the salt-pepper noise appears in the classification map of SVM. In contrast, the classification maps of the convolutional networks are smooth. It shows that convolutional networks can improve the smoothness of the classification maps by extracting spatial features of HSI datasets.
To further evaluate the performance of the proposed method, experiments are implemented on a high spatial resolution HSI dataset, which is the HH dataset (0.043 m per pixel). From Table VIII, we can see that the spectral-based classification method (SVM) achieves the lowest OA (80.4%) except for the EMFFN. It indicates that it is difficult to classify the land cover objects using only spectral signatures on the HH dataset. EMFFN obtains the lowest OA of 78.02%. HYSN and SSRN achieve higher OAs (88.01%, 85.70%) than SVM and EMFFN. Observing the classification accuracy of various categories, we can see that some categories are still hard to be classified for SVM, HYSN, SSRN, and EMFFN such as C2, C4, C6, C9, C10, C11, C13, C14, C15, and C16. Especially, the C9 failed to be classified by SVM (26.04%), SSRN (0.00%), and EMFFN (29.95%). In contrast, DBMA and DBDA obtain better classification accuracies (95.64%, 94.90%) than those of the former methods. The PCIA consistently achieves competitive results (95.06%), which indicates The GF dataset is a forest farm that is applied to forestry tree species classification. The spectral responses of different plants of the same family and genus are very close to each other, and the classification results of most existing spectral-based methods tend to be reduced. As shown in Table IX, the OA of SVM is 76.26%. For some specific classes, such as C1, C2, C3, and C5, the accuracy is less than 50%. HYSN, SSRN, and EMFFN provide better classification accuracies than SVM. However, the accuracies of C1 (75.17%, 54.71%, 26.06%) and C3 (73.08%, 66.40%, 42.29%) are still insufficient. Conversely, DBMA, DBDA, PCIA, SSGC, OSDN, and PMCN provide satisfactory classification accuracies, especially for C1, C3, and C5. PMCN achieves competitive results in most cases. The full-factor classification maps are shown in Fig. 12, and the classification map by the PMCN is almost the same as the ground truth.
Furthermore, the AH dataset is applied to evaluate the performance of the methods. It is a satellite dataset with mining and agriculture areas. In particular, the labeled samples of the AH dataset are disjointly marked. It is a challenge to effectively extract the spatial feature of a pixel. As shown  Table X, the spatial-spectral-based deep convolutional networks (HYSN, SSRN, EMFFN, DBMA, DBDA, PCIA, SSGC, OSDN, and PMCN) achieve limited improvement than the spectral-based method (SVM), which ranges from 0.88% to 7.58%. The reason is that the disjointly marked samples restrict the ability of the cube-based approach to extract spatial information. Under the condition of restricted spatial information, the discrimination capability of convolutional networks cannot be sufficiently exploited. Benefiting from the multiscale property of PyConv, PMCN obtains the highest classification accuracy (80.73%) among the competitors. The full-factor classification maps for the AH dataset are shown in Fig. 13. We can see that PMCN yields a finer-grained classification map than that of DBMA, DBDA, PCIA, SSGC, and OSDN. This may be due to the ability of polarized attention blocks to extract detailed spatial and spectral features of pixels.
Finally, the HU dataset is employed to evaluate the performance of the methods under limited labeled sample condition. In the experiment, the number of training samples of different categories ranges from 3 to 12. It is difficult to learn discriminative information effectively from such a small number of training samples. As shown in Table XI improve the OAs significantly ranging from 3.74% to 8.39%. From Fig. 14, we can see that the SVM, HYSN, SSRN, and EMFFN achieve finer-grained classification maps than that of the later methods. It shows that these methods prefer to adopt spectral signatures to classify the pixels. In contrast, the DBMA, DBDA, PCIA, SSGC, OSDN, and PMCN make more use of spatial contextual information to extract discriminative features and obtain spatially smoother classification maps.

A. Comparison of Different Spatial Patch Sizes
In this section, we will focus on the issue of patch size, which is a hyperparameter of the cube-based convolutional   by providing insufficient or excessive spatial information. As shown in Table XII, we report the OAs with different spatial patch sizes ranging from 7×7 to 15×15 with a 2-pixel interval. We can see that the classification accuracies vary with the patch sizes. The best OA is acquired when the patch size is 11 × 11 in the UP, HH, GF, and HU datasets, which is as expected. The classification maps of UP with different patch sizes are shown in Fig. 15 as an example. However, the best OA is acquired when the patch size is 7 × 7 in the AH dataset. It is understandable that the labeled samples in the AH dataset are disjointly marked, which is different from the other datasets. The spatial neighborhood area of pixels is restricted in the AH dataset. As a result, a larger patch size cannot effectively provide more spatial information for the network, but rather affect the discriminability of the pixels. In practice, we recommend using smaller patch sizes for datasets that provide disjoint labeled samples. In our experiment, we consistently choose 11 × 11 as the value of the patch size for the four datasets to keep consistency.

B. Comparison of Different Training Sample Proportions
In this section, we will discuss the performance of the competitors under different proportional training sample conditions in the five HSI datasets. It is an important analysis that the supervised learning methods are data-driven-based and the percentage (number) of the training samples plays a leading role in the learning process of the models. In order to comprehensively analyze the performance of the proposed PMCN under different proportional training sample conditions, we randomly select 0.5%, 1%, 1.5%, 2%, 3%, 4%, and 5% of labeled samples for UP, HH, GF, and HU datasets, and 5%, 6%, 7%, 8%, 9%, and 10% of labeled samples for AH dataset as the training samples. In general, a larger proportion of training samples can provide more discriminative information for the data-driven-based classification methods, thus improving the classification accuracy of the models. The classification results are reported in Fig. 16. It can be seen clearly that the classification accuracies of the methods increase with the growth of the training sample proportion as expected. With a smaller percentage of training samples (0.5% for the UP, HH, GF, and HU datasets and 5% for the AH dataset), the classification accuracies of the competitors are subsequently reduced. Comparing the classification methods, the classification accuracies of SSGC, OSDN, and PMCN decrease less than those of other methods. It indicates that these classification methods are more capable of extracting discriminable features with limited labeled samples. PMCN achieves consistently competitive results with the increase of the training sample proportion. Specifically, we can see in Fig. 16(d) and (e) that PMCN obtains higher classification accuracies than other methods for the AH and HU datasets. It indicates that PMCN has the best ability to effectively extract discriminable features under the condition of limited spatial context information and limited training samples. The experimental results demonstrate again the utility and effectiveness of the combination of pyramidal multiscale convolutional block and polarized attention block for HSI classification tasks and provide thoughts for researchers to design high-performance networks.

C. Comparison of Computational Cost and Complexity
In the following, we will discuss the computational cost and complexity of the proposed PMCN. Table XIII shows    We can see that the values of parameters and FLOPs vary with the size of the datasets and methods. In general, a larger dataset size leads to larger values of parameters and FLOPs. Checking the values of parameters, HYSN contains the highest number of parameters. It is because HYSN uses cascading stacked 3-D convolutional layers to jointly extract the spatial and spectral features. The EMFFN provides the second-highest number of parameters. It is due to the multiscale convolutional layers of the network. SSRN, DBMA, DBDA, PCIA, and SSGC contain a similar number of parameters, which are significantly lower than that of HYSN and EMFFN. It is because these methods improve the traditional cascading stacked 3-D convolutional layers to specialized 3-D convolutional blocks and divide the feature extraction module into the spatial branch and spectral branch individually. PMCN and OSDN contain a lower number of parameters than the former methods. It benefits from the use of lightweight feature extraction modules and the one-shot aggregation mechanism, which enables the extracted features to be finely fused in the convolutional networks. PMCN contains a larger number of parameters than OSDN due to its pyramidal multiscale convolutional blocks.
Observing the FLOPs of the methods, PMCN obtains the highest value of FLOPs. It is because PMCN processes the raw input data without reducing the dimensions. As a result, it is considered to use a dimension reduction algorithm to process the raw dataset to reduce the FLOPs of PMCN. In addition, the multiscale pyramid blocks also increase the FLOPs of PMCN. HYSN obtains higher FLOPs than those of other methods, except for PMCN. SSRN, DBMA, DBDA, PCIA, and SSGC obtain similar FLOPs. OSDN obtains lower FLOPs than that of the other methods except for EMFFN as expected. EMFFN obtains the lowest FLOPs in most cases for fewer convolutional layers conducted in the framework.

D. Ablation Analyses
In this section, we design four ablation experiments to analyze the effectiveness of the technologies applied in the proposed network, including the attention mechanism, the one-shot aggregation, the PyConv, and the Mish activation function. First, we perform an ablation experiment on the effectiveness of the attention mechanism. In the PMCN, two polarized attention blocks are implemented: channelonly attention block and spatial-only attention block. The  Fig. 17(a). Model 1 denotes that no attention mechanism is applied in the PMCN. Model 2 denotes that only channel-only PSA block is applied in the PMCN. Model 3 denotes that only spatial-only PSA block is applied in the PMCN. Model 4 denotes that both channel-only and spatialonly blocks are used in the PMCN. Taking the UP dataset as an example, the baseline OA of PMCN is 92.64% when polarized attention blocks are not applied. Both channel-only attention block alone and spatial-only attention block alone improve the classification accuracy on the basis of baseline (0.88%, 3.99%). Comparing the improvement of classification accuracy of five HSI datasets by channel-only attention block and spatial-only attention block, it is found that the boost on network discrimination is variable in different datasets. It shows that the validity of the channel-only attention block and spatial-only attention block is determined by the characteristics of the dataset, which is not invariable. Finally, as expected, PMCN obtains the highest classification accuracy by using both channel-only and spatial-only PSA blocks. Second, we perform an ablation experiment on the effectiveness of the one-shot aggregation. The OAs are shown in Fig. 17(b). Model 1 denotes that one-shot aggregation is not applied in the PMCN, while Model 2 denotes that one-shot aggregation is applied. We can see that there is a slight improvement in classification accuracy when using one-shot aggregation, which ranges from 0.54% to 3.63%. The experimental results convincingly demonstrate the effectiveness of one-shot aggregation. Third, we conduct an ablation experiment on the effectiveness of the PyConv. The OAs are shown in Fig. 17(c). Model 1 denotes that only single-scale convolutional layers with 5 × 1 × 1 and 1 × 5 × 5 window sizes are applied in the two branches of PMCN. Model 2 denotes that pyramidal convolutional blocks are applied in the PMCN. We can see a significant improvement in classification accuracy when using PyConv, especially on the AH dataset (63.35%, 80.73%). It indicates that PyConv is superior to single-scale convolution in its ability to extract discriminative features from spectral signatures and spatial context information. Finally, we conduct the ablation experiment on the effectiveness of the Mish activation function. The classification results are shown in Fig. 17(d). Model 1 denotes that PReLU is applied in the classification subsection network of PMCN. Model 2 denotes that Mish is applied. We can see consistent improvements in classification accuracy on the five HSI datasets, which range from 0.31% to 3.39%. The results confirm the validity of the Mish activation function.

VI. CONCLUSION
In this article, a pyramidal multiscale convolutional network with PSA is proposed for pixel-wise HSI classification. The proposed PMCN mainly contains three stages: channelwise feature extraction network, spatial-wise feature extraction network, and classification network. Pyramidal convolutional blocks and polarized attention blocks are converted to extract spectral and spatial features, respectively. The pyramidal convolutional blocks are used to extract multiscale features, and the polarized attention blocks are used to provide more flexibility. Compared to the previous attention mechanisms used in HSI classification methods, polarized attention can better process HSI with high internal resolution. Furthermore, residual aggregation and one-shot aggregation are employed to fuse feature maps of different layers. Finally, a classification network is used to obtain the classification results. Five different types of HSIs are introduced to evaluate the performance of the proposed PMCN. Nine representative methods are employed for our comparison. The experimental results show that the proposed method provides competitive performance among the related methods. In addition, the spatial patch size, training sample proportion, computational cost, and ablation analyses are discussed. In the future, we will combine PSA mechanism with other convolutional networks and apply these models to other HSI datasets.