Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 18

Local-Global Feature Extraction Network With Dynamic 3-D Convolution and Residual Attention Transformer for Hyperspectral Image Classification

Abstract:

Currently, convolutional neural network (CNN) and transformer-based hyperspectral image (HSI) classification methods have attracted significant attention owing to their e...Show More

Metadata

Abstract:

Currently, convolutional neural network (CNN) and transformer-based hyperspectral image (HSI) classification methods have attracted significant attention owing to their effective feature representation capabilities. However, methods based on CNN pay insufficient attention to valuable pixels in 3-D HSI samples and cannot adapt to variations in these samples. Transformer-based methods also suffer from high computational complexity and a tendency for low-level spatial-spectral features of the shallow attention layer to vanish as the number of attention layers increases. To address these issues, we proposed a local–global feature extraction network with dynamic 3-D convolution and residual attention transformer (LGDRNet). The LGDRNet primarily consists of multiscale 3-D conv, dynamic local feature extraction, residual global feature extraction, and feature fusion modules. Specifically, a multiscale 3-D conv module is used for low-level multiscale spectral information extraction. Then, the dynamic local feature extraction module utilizes dynamic 3-D convolution, which can adapt to different samples. This allows the network to focus on valuable pixels in 3-D samples. The residual global feature extraction module utilizes a convolutional projection unit and convolutional multihead self-attention to reduce computational complexity. It employs a residual attention connection to enable the network to effectively transmit and accumulate attention information across consecutive multihead attention layers. This prevents the vanishing of shallow spatial-spectral features. Finally, local and global HSI information may be efficiently integrated using the feature fusion module, which also improves performance during subsequent classification. The proposed model achieves overall classification accuracies of 89.24%, 92.01%, and 94.53% on three benchmark datasets, respectively, outperforming state-of-the-art approaches with limited training samples.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 18)

Page(s): 9986 - 10001

Date of Publication: 01 April 2025

ISSN Information:

DOI: 10.1109/JSTARS.2025.3556722

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

In Recent years, advancements in remote sensing technology have allowed hyperspectral sensors to collect increasingly precise information [1]. The resulting hyperspectral image (HSI) obtained contain more spectral bands and rich 2-D spatial information relating to the studied objects [2], [3]. As a result, they have significant applications in fields such as biomedical research, forest monitoring, agriculture, military surveillance, and mineral exploration [4], [5], [6], [7], [8]. To fully leverage the potential of HSI analysis, HSI classification has become an important focus in the development of HSI processing techniques.

HSI classification involves the production of a pixel label prediction for every pixel [9]. Traditional HSI classification methods include k-nearest neighbors [10], logistic regression [11], [12], maximum a posteriori estimator [13], and support vector machine [14]. Soomro et al. [15] designed an effective technique to combine the pathwise coordinate descent algorithm with variable splitting and augmented Lagrange methods, employing the true labels of training samples as additional constraints to improve classification accuracy. These methods can achieve satisfactory results in spectral feature extraction. However, the effectiveness of classification is severely limited by excessive reliance of these approaches on manual feature design and failure to consider the spatial connection between pixels, which results in an inadequate use of spatial information.

The progress of HSI classification has been greatly advanced by the introduction of deep learning (DL) [16]. Exceptional performance of CNN means that it is common to apply DL to HSI classification tasks [17], [18], [19], [20], [21]. Song et al. [22] developed a novel CNN network capable of extracting more discriminative information from HSI. He et al. [23] integrated multiscale covariance graphs into the CNN method, in order to allow the network to concurrently utilize the spatial and spectral information inherent to HSI. Ullah et al. [24] introduced a deep smooth wavelet CNN approach for HSI classification that uses the wavelet transform to extract spectral features of HSI and a 2D CNN to obtain spatial features of HSI, resulting in spatial-spectral features suitable for HSI classification. The majority of 2D-CNN-based networks are capable of improving HSI classification performance. However, they are unable to properly utilize the local spectral information within HSI. Roy et al. [25] designed a multiscale dense network by aggregating features from different levels and scales. Zhong et al. [26] learned discriminative spatial-spectral features using spectral and spatial 3-D-CNN residual blocks. Xi et al. [27] proposed a multidirectional network based on a 3-D-CNN using squeeze and excitation mechanisms to learn spatial-spectral features from multidirectional samples. Wang et al. [28] designed a global spatial-spectral learning framework that utilizes 3-D-CNN with long short-term memory networks in order to learn the links between spectral features. Despite the ability of 3-D-CNN-based networks to extract local spatial-spectral information, it remains necessary to consider parameteric issues. In addition, due to the limited receptive field of convolutional kernels, 3-D-CNN excels in capturing local spatial-spectral features, but performs poorly in terms of global feature extraction and long-range information dependence.

The transformer provides a groundbreaking technology with innovative approaches for DL tasks. Vision Transformer (ViT) was first applied to computer vision tasks by Dosovitskiy et al. [29], achieving better performance than convolutional neural network (CNN). It can fully utilize global information and capture long-range dependency relationships, and has thus attracted widespread attention in the field of image processing. Mohamed et al. [30] utilized a spatial-spectral transformer to comprehensively capture the relationships between features within HSI data cubes, designing a masking mechanism for training spatial-spectral transformers. Hong et al. [31] proposed the SpectralFormer network, capable of more effectively learning the local spectral sequence information of adjacent bands. Sun et al. [32] proposed the MASSFormer network, which can convert spatial-spectral information into prior knowledge as a memory token for multihead attention. However, despite many advancements, transformer-based methods continue to face the challenge of high-computational complexity. In addition, transformer-based methods lack inductive bias and the ability to capture local information effectively. This results in their inability to fully utilize the spatial information contained within HSI.

Consequently, combining CNN and transformer methods to utilize their respective strengths has become a heavily researched topic. The resulting models have been demonstrated to offer higher performance, faster training speeds, and a reduction in the number of parameters [33], [34], [35], [36], [37], [38]. Huang et al. [39] proposed the APSFFT network, which employs an adaptive cross-token attention fusion encoder to integrate information extracted by the CNN, transformer, and GCN methods, thereby enhancing the interaction capabilities between pixel- and superpixel-level information. Sun et al. [40] proposed the SSFTT model, which extracts spatial-spectral information using 2-D-CNN, 3-D-CNN, and transformer. Zhang et al. [41] introduced LGGNet, which incorporates prompting techniques into HSI classification, integrating 3-D-CNN and transformer methods for HSI classification. Sun et al. [42] utilized a multiscale hybrid CNN approach to extract local spatial-spectral information from HSI, thereby creating a DConvformer by merging mathematical morphology with a classic transformer encoder to aggregate previously captured multiscale information.

Although these CNN and Transformer hybrid approaches have shown good performance, some issues remain. LGGNet [41], inspired by Gaussian learnable position encoding, utilizes prior knowledge to initialize trainable parameters in order to collect the spatial values of various pixel points within the HSI sample, but it overlooks variations in the input samples, leading to an inability of the network to adapt to different input samples. Furthermore, as the depth of the attention layers increases, the deep attention layer tends to overlook spatial-spectral features from the shallow attention network; however, these features are also very important for subsequent classification. In addition, reducing computational complexity is also worthy of investigation. In order to solve these issues, this study proposes a local–global feature extraction network with dynamic 3-D convolution and residual attention transformer (LGDRNet) for HSI classification. The following is a summary of the primary contributions made by this study.

1) In HSI classification, a dynamic local feature extraction module is proposed for the first time. This module employs 3-D dynamic convolution to generate different weight matrices for different 3-D sample inputs, enhancing the network's focus on valuable pixels. Compared to 2-D dynamic convolution methods, our approach can perform local computations simultaneously in both spatial and spectral dimensions without disrupting the 3-D characteristics of HSI.

2) A convolution projection unit and a convolution multihead self-attention mechanism with a residual attention connection are introduced in the newly designed residual global feature extraction module. This module is designed to capture long-range dependencies between spectral bands and can fully utilize both shallow and deep spatial-spectral information in order to increase the classification accuracy of the network. Unlike previous transformer-based methods, our approach uses a convolutional projection unit in place of the linear layer, which reduces its computational complexity. Furthermore, to ensure that the network does not overlook the shallow spatial-spectral information, a residual attention connection is added and incorporated into the convolution multihead self-attention mechanism.

3) An LGDRNet is proposed, consisting of a multiscale 3-D conv module, a dynamic local feature extraction module, a residual global feature extraction module, and a feature fusion module. This network can improve classification accuracy by fully utilizing local–global spatial and spectral features from limited HSI samples.

SECTION II.

Related Work

A. Transformer Approaches

An increasing number of studies are employing the transformer for HSI classification tasks due to its exceptional performance [43], [44], [45]. Mei et al. [46] proposed a group-aware hierarchical transformer architecture that introduces a new grouped pixel embedding module, allowing the network to extract global and local spatial-spectral contexts. Huang et al. [47] introduced a masked transformer architecture that comprises a contrasted transformer and two supervised transformers, allowing the network to obtain more discriminative spatial-spectral data. Wu et al. [44] proposed a token enhanced transformer that uses a multiscale CNN to generate multiscale spatial-spectral information as token embeddings, together with novel, hash-based, spatially enhanced positional embedding to connect multiscale token embeddings and a hash-based positional embeddings connected input transformer for long-range spatial-spectral information extraction. Shi et al. [48] proposed an attention transformer for attention head interaction that first employs a spatial local bias module capable of extracting local spatial-spectral information, followed by a diagonal mask multiscale dual attention module that allows the network to effectively utilize the information acquired by different attention heads. This approach can extract spatial-spectral features at various scales using feature vectors. Although the aforementioned transformer-based techniques have shown remarkable effectiveness in modeling long-range relationships, their capacity for obtaining local spatial information remains limited, restricting the network from effectively using HSI data.

B. CNN and Transformer Hybrid Approaches

Local and global spatial-spectral information obtained from HSI can be captured by CNN and Transformer hybrid approaches. Xu et al. [49] constructed a cross-attention fusion network that concentrates on long-range information using a Gaussian transformer branch and extracts local spatial features using a CNN branch. Zhong et al. [50] proposed a dual-branch convolution block and a multigranularity window shift transformer block to efficiently extract multiscale local and global spatial-spectral information. Yu et al. [51] constructed a pair of interactors, Conv2Trans, and Trans2Conv, which combine CNN and ViT techniques to create a network that concentrates on local and global features. Nevertheless, the aforementioned techniques link only the CNN and the transformer, and do not make full use of their respective benefits.

In order to address the above issues, Zhao et al. [52] devised a representation encoder architecture capable of avoiding the network's exclusive focus on local or global feature extraction. In addition, Zhang et al. [41] proposed a cross fusion network which efficiently integrates the local spatial-spectral features obtained by 3-D convolution with the global spatial-spectral features obtained by the transformer. Although the aforementioned techniques have shown promising results, there are several issues. For example, the aforementioned hybrid approaches employ conventional convolution for feature extraction, which results in the network paying insufficient attention to valuable pixels in different samples of HSI. Furthermore, the transformer branch in the hybrid models contains a lot of parameters and suffers from the problem of shallow attention features that are useful for later classifications tending to disappear.

In this work, we constructed a residual global feature extraction module and a dynamic local feature extraction module. The residual global feature extraction module employs convolutional projection to reduce the network complexity. Furthermore, it employs residual attention connection, allowing the network to make effective use of shallow attention features. The dynamic local feature extraction module utilizes 3-D dynamic convolution, which gives the network a more efficient receptive field and enhances its focus on valuable pixels in diverse HSI samples.

C. Dynamic Convolution

Dynamic convolution can adapt to different sample inputs, generating convolutional kernels with varying weights, thereby enhancing the ability of the CNN model to discern valuable information [53], [54], [55], [56]. Dynamic convolution is implemented using the dynamic weights principle. Dynamic weights are a significant advantage of self-attention, enabling a model to dynamically capture information based on its input. Han et al. [56] confirmed that dynamic convolution has fewer parameters than the shifted window attention module in a Swin Transformer and can achieve better performance. He et al. [57] proposed a dynamic multiscale network that extracts multiscale contextual features. Su et al. [58] proposed a dynamic group convolution network capable of adaptively extracting rich information and automatically choosing the most significant input channel in the input feature map. Li et al. [59] introduced a full-dimensional dynamic convolution allowing convolution kernels to process input features in parallel with a multidimensional attention. Li [60] constructed a dynamic group convolution network, allowing CNN to capture richer spatial-spectral features while avoiding the problem of substantial duplicated information between convolution kernels. Ding et al. [61] proposed a dynamic convolution method for structural reparameterization that allows the network to adaptively capture contextual spatial-spectral information with more representational capacity. Lou et al. [62] proposed the IDConv module, which prevents the static character of convolution and enables the network to better aggregate local information. Although the aforementioned dynamic convolution approaches performed well in HSI, they neglected to account for HSI's inherent three-dimensionality (3-D). The structure of HSI may be disrupted by the use of dynamic 2-D convolutions for feature extraction, which might negatively affect the capacity of the network to identify local features in HSI. Accordingly, we intend to establish a 3-D dynamic convolution approach for HSI classification tasks that may dynamically capture local spatial-spectral information without disrupting the 3-D structure of HSI, hence enhancing the receptive field of the network.

SECTION III.

Proposed Method

The overall framework of LGDRNet is shown in Fig. 1, including the multiscale 3-D conv, dynamic local feature extraction, residual global feature extraction, and feature fusion modules.

Fig. 1.

Overall framework of the proposed LGDRNet for the HSI classification.

Show All

A. Data Preprocessing

HSI data were used as the input $I\in {{\mathbb {R}}^{H\times W\times C}}$ as shown in Fig. 1, where $H\times W$ represents the spatial size and $C$ represents the spectral dimension. HSI data is initially processed using PCA, which maintains the spatial size while reducing the spectral dimension from $C$ to $B$ $(B< C)$ . Then, $L$ 3-D cube samples are divided from the HSI data, and each 3-D cube sample is represented as $X\in {{\mathbb {R}}^{U\times U\times B}}$ , where $U$ denotes the spatial size of cubes and $L$ ( $L=H\times W$ ) denotes the number of HSI samples. The 3-D cube sample set is then divided into training, validation, and testing sets in a specific proportion.

B. Multiscale 3-D Conv Module

As shown in Fig. 2, to extract the multiscale spectral features, a multiscale 3-D conv module is utilized. First, 3-D-CNN is used to extract multiscale spectral features with different kernel sizes. In addition, the BatchNorm3-D (Norm) and ReLU activation functions are applied after each convolution layer. Lastly, feature maps are obtained by connecting the spectral features of each convolutional layer at various scales in series along the channel dimension. The multiscale 3-D conv module provides richer spectral information for the model by extracting multiscale spectral features using various receptive fields.

Fig. 2.

Structure of multiscale 3-D conv module.

Show All

C. Dynamic Local Feature Extraction Module

Most widely used dynamic convolution methods, such as DyConv [55] and IDConv [62], employ 2-D dynamic convolution. However, 2-D dynamic convolution may break the inherent structure of the HSI during the feature extraction stage, resulting in the network capturing insufficient local spatial information. In order to fully extract comprehensive local spatial-spectral information, a dynamic local feature extraction module is constructed, using 3-D dynamic convolution, which offers the network a more effective receptive field for responding to HSI input variations without disrupting the 3-D structure of HSI.

Fig. 3 shows an input feature ${{X}_{3-D}}$ of $R\times U\times U\times B$ ( $U\times U$ represents the spatial size, $R$ represents number of feature maps). The size is compressed to ${{D}^{3}}$ using 3-D adaptive average pooling to produce the attention map $\hat{T} \in R^{R\times D\times D\times D}$ . Using 3-D Adaptive AvgPool3d to aggregate important information from various samples and capture a more representative spatial and spectral context. Subsequently, the attention map are fed into two 1 $\times \text{1} \times \text{1}$ continuous 3-D convolutional layers to obtain ${T}^{\prime }\in {{\mathbb {R}}^{(N\times R)\times D\times D\times D}}$ . The number of attention groups is represented by $N$ . Then, the attention weights ${{T}^{{}}}\in {{\mathbb {R}}^{N\times R\times D\times D\times D}}$ are obtained by ${T}^{\prime }$ reshaping into ${{\mathbb {R}}^{N\times R\times D\times D\times D}}$ . Finally, a set of learnable parameters ${{M}^{{}}}\in {{\mathbb {R}}^{N\times R\times D\times D\times D}}$ is elementwise multiplied with $T$ and summed over $N$ to produce a deep convolutional kernel $F\in {{\mathbb {R}}^{R\times D\times D\times D}}$ that can sense bias and changes in the input. The dynamic 3-D spatial-spectral block can be expressed by the following:

$\begin{align*} \hat{T}& = \text{AdaptivePool3d}({{X}_{3d}}) \\{T}^{\prime }&=\text{Conv3d}_{1\times 1\times 1}^{\frac{R}{r}\to (N\times R)}(\text{Conv3d}_{1\times 1\times 1}^{R\to \frac{R}{r}} (\hat{T}))\tag{1}\\ T &=\text{Softmax}\left(\text{Reshape}\left({{T}^{\prime }} \right) \right) \tag{2}\\ F &= \sum \nolimits _{i=0}^{N-1}{{{M}_{i}}{{T}_{i}}} \tag{3} \end{align*}$ View Source

where r represents the channel squeeze-and-expansion ratio of the two consecutive

$\text{1}\times \text{1}\times \text{1}$

3-D convolutional layers. Different inputs will produce different attention maps

${{T}^{\prime}}$

, resulting in different attention weights

$T$

, and the convolutional kernel

$F$

. Therefore, our dynamic local feature extraction module can perform dynamic local feature encoding to extract local spatial-spectral features.

Fig. 3.

Structure of dynamic 3-D spatial-spectral block.

Show All

D. Residual Global Feature Extraction Module

Although transformer-based approaches effectively capture long-range dependencies in HSI, they have high computation costs. In addition, as the depth of the attention network increases, fine-grained features in the shallow attention layers gradually disappear. Inspired by ConvTE [63] and Revit [64], the authors designed a residual global feature extraction module, employing convolutional operations to reduce computational complexity and residual attention connections to reserve shallow detail features. This module more efficiently extracts global spatial-spectral features and recognizes long-term dependencies in HSI, while also featuring low computational complexity.

The residual global feature extraction module consists of two residual attention transformer blocks. As shown in Fig. 4, the residual attention transformer block is composed of 3-D convolution (3-D Conv), convolutional projection, convolutional multihead self-attention, convolutional MLP (CMLP), and residual attention connection modules.

Fig. 4.

Structure of residual attention Transformer block.

Show All

In Fig. 4, firstly, the channel information of multiscale features is integrated using 3-D Conv with a channel number of one and a convolution kernel size of $\text{3} \times \text{3} \times \text{3}$ . Subsequently, a convolutional projection is produced from a 3-D Conv with a channel number of one and a convolution kernel size of $\text{3} \times \text{3} \times \text{1}$ . This means that each spectral band of feature is treated as a token. ${{Q}^{c}_{l}}$ , ${{K}^{c}_{l}}$ , ${{V}^{c}_{l}} \in {{\mathbb {R}}^{1\times U\times U\times B}}$ ( $l$ represents the $l$ th residual attention transformer block and $c$ represents the $c$ th attention head within residual attention transformer block) matrices are generated by convolution projection, which can avoid flattening operations that may destroy spatial information. ${{Q}^{c}_{l}}$ , ${{K}^{c}_{l}}$ , and ${{V}^{c}_{l}}$ are reshaped into a 1-D feature vector ${{Q}^{c}_{l}}$ , ${{K}^{c}_{l}}$ , and ${{V}^{c}_{l}} \in {{\mathbb {R}}^{\alpha \times \beta }}$ (where the size of $\alpha$ is $B$ and the size of $\beta$ is $\text{1}\times U\times U$ ). The query, key, and value of the convolutional multihead self-attention can be calculated as follows:

$\begin{align*}{Q}_{l}&=\text{Concat}({Q}^{1}_{l},{\ldots },{Q}^{c}_{l},{\ldots }) \tag{4}\\{K}_{l}&=\text{Concat}({K}^{1}_{l},{\ldots },{K}^{c}_{l},{\ldots }) \tag{5}\\{V}_{l}&=\text{Concat}({V}^{1}_{l},{\ldots },{V}^{c}_{l},{\ldots }). \tag{6} \end{align*}$ View Source

Convolutional multihead self-attention inherits the ability of the transformer to capture dependencies between long-range information, allowing global spectral features in HSI to be more accurately captured. Using convolutional operations on the $K_{l}$ and $V_{l}$ matrices to reduce feature size can mitigate data redundancy and reduce computational complexity.

As shown in Fig. 4, residual attention connections used in convolutional multihead self-attention allow attention to propagate continuously from shallow to deep layers, expanding the flow of specific information between adjacent residual attention transformer blocks and enhancing feature diversity. For the first residual attention transformer block, convolutional projection generates the query, key, and value ( $Q_{l}$ , $K_{l}$ , $V_{l}$ ) spaces, then the attention scores ${{S}_{l}}$ ( ${S}_{l}= \frac{(Q_{l}\times K_{l}^{T}) }{\sqrt{{{d}}}}$ ) are calculated, where ${{d}}$ is the size of the key. Subsequently, the Softmax function is used to map ${{S}_{l}}$ to the range (0, 1). Finally, the output $Z_{l}$ is obtained by multiplying $V_{l}$ with $\text{Softmax}({{S}_{l}})$ . For other residual attention transformer blocks beyond the first, after the aforementioned operations, the information from ${S}_{l-1}$ must also be propagated. As a result, the other residual attention transformer blocks are required to consider both the previously extracted relationships and new relationships. The residual attention transformer block can be expressed according to the following:

$\begin{align*}{{Z}_{l}}&=\text{Softmax}\left({{S}_{l}} \right)\times V_{l} \tag{7}\\{{S}_{l}}&=\left\lbrace\!\! \begin{array}{ll}\frac{{{Q}_{l}}.K_{l}^{T}}{\sqrt{{{d}}}}, & \text{if }l=0 \\ \alpha \left(\frac{{{Q}_{l}}.K_{l}^{T}}{\sqrt{{{d}}}} \right)+\left(1-\alpha \right)\left({{S}_{l-1}} \right),& \text{ Otherwise} \\ \end{array} \right. \tag{8} \end{align*}$ View Source

where

$l$

represents the

$l$

th residual attention transformer block, and

$\alpha$

$(\alpha \in [ 0,1 ])$

represents a learnable variable that allows the network to autonomously decide how much attention to propagate between the residual attention transformer blocks.

Finally, features of the residual attention transformer block are output by CMLP, which is a 3-D-CNN with a channel number of one and a convolution kernel size of $\text{3} \times \text{3} \times \text{1}$ .

Residual attention transformer blocks enable the network to fully extract global spectral features. Relative to linear layers, convolutional projection not only reduces computational complexity but also mitigates the shortcomings of linear projection in capturing local spatial information. This allows residual attention transformer blocks to extract local spatial features.

E. Feature Fusion Module

In order to generate a more expressive feature representation, the feature fusion module is utilized for integrating local and global spatial-spectral features. It applies elementwise addition to the local spatial-spectral features extracted by the dynamic local feature extraction module and the global spatial-spectral features obtained by the residual global feature extraction module. The feature fusion module combines local and global spatial-spectral features to produce a more complete feature representation, reducing reliance on a single feature and making the model more robust in complex circumstances. It also integrates the benefits of the dynamic 3-D spatial-spectral module with the residual attention transform module to emphasize important features, increasing the accuracy of subsequent classification results.

SECTION IV.

Experiments

We evaluate the classification performance of LGDRNet using three distinct datasets to confirm the efficacy of the model. The HanChuan (HC), HongHu (HH), and Salinas (SA) datasets were utilized in the evaluations.

A. Data Description

The HC dataset was obtained in Hanchuan City, Hubei Province, China. It features 16 categories, 274 spectral bands, $\text{1217} \times \text{303}$ pixels, and a spatial resolution of 0.109 m. The ground truth image and false-color image are shown in Fig. 5.

Fig. 5.

HC dataset. (a) False-color map. (b) Ground-truth map.

Show All

The HH dataset was collected in HongHu City, Hubei Province, China. It includes 22 different categories and has $\text{940} \times \text{475}$ pixels, 270 spectral bands, and a spatial resolution of 0.043 m. The ground truth image and false-color image are shown in Fig. 6.

The SA dataset was captured over the Salinas Valley in California. It has 16 distinct classes, a spatial resolution of 3.7 m, 224 spectral bands, and a resolution of $\text{217} \times \text{224}$ pixels. The ground truth image and false-color image are shown in Fig. 7.

Fig. 6.

HH dataset. (a) False-color map. (b) Ground-truth map.

Show All

Fig. 7.

SA dataset. (a) False-color map. (b) Ground-truth map.

Show All

Each dataset is separated into training, validation, and test sets. 0.1%, 0.1%, and 0.2% of the total samples in the HC, HH, and SA datasets are chosen at random to serve as training samples. The validation samples comprise the same percentage with the training samples, and the remaining samples are used for testing. Comprehensive details related to these three datasets are listed in Tables I–III.

TABLE I Training and Testing Samples in HC Dataset

TABLE II Training and Testing Samples in HH Dataset

TABLE III Training and Testing Samples in SA Dataset

B. Experimental Setting

In order to evaluate the performance of LGDRNet, it was compared with other DL methods, including CNN-based methods (DFFN [22] and RSSAN [65]), transformer-based methods (SpectralFormer [31], morphFormer [66], and FactoFormer[30]), and hybrid methods based on both the CNN and transformer (HybridSN [25], LGGNet [41], and GSC-ViT [67]). The average accuracy (AA), overall accuracy (OA), and Kappa coefficient (Kappa) are utilized to quantitatively evaluate how well the proposed method performed in comparison to other approaches. Ten runs are averaged to obtain our experimental results. The higher the value of each metric, the better the classification performance.

The experiments for the proposed method were conducted entirely in the PyTorch environment, utilizing a CPU (Intel i7-13700HX), a GPU[NVIDIA GeForce RTX 4060(8 GB)], and 16 GB of RAM. The average of each experiment was calculated based on ten repetitions. The proposed LGDRNet was implemented using Python 3.10.11 and PyTorch 2.0.1.

A batch size of 512 was employed throughout the training, validation, and testing process. For the proposed LGDRNet, patch sizes were configured using the HC, HH, and SA datasets: $\text{15} \times \text{15}$ , $\text{15} \times \text{15}$ , and $\text{13} \times \text{13}$ , respectively. The size of the spectral bands after PCA dimensionality reduction is 15. In total, 500 epochs were trained for each dataset using the Adam optimizer with a learning rate of 1e–3.

C. Classification Results Analysis

1) Quantitative Analysis: To conduct the quantitative analysis, Tables IV–VI summarize the classification results obtained from the HC, HH, and SA datasets. As shown, the LGDRNet surpasses all other approaches, with the best outcomes displayed in bold.

TABLE IV Classification Results of Different Methods on the HC Dataset

TABLE V Classification Results of Different Methods on the HH Dataset

TABLE VI Classification Results of Different Methods on the SA Dataset

On the HC, HH, and SA datasets, the proposed LGDRNet obtained OAs of 89.24%, 92.01%, and 94.53%, respectively. LGDRNet achieved higher classification accuracy relative to the other approaches. These experimental results demonstrate that the proposed method is effective.

Table IV demonstrates that LGDRNet achieves 89.24% for OA and 87.39% for Kappa accuracy on the HC dataset, which is the best result obtained. In comparison, LGGNet's OA and Kappa accuracies were 85.70% for OA and 83.22% for Kappa, respectively. Compared with LGDRNet, the OA and Kappa scores are reduced by 3.54% and 4.17%, respectively. Furthermore, the OA and Kappa scores of GSC-ViT are lower than LGDRNet. LGDRNet is more efficient because it effectively integrates the benefits of CNNs and transformers. Compared to the LGGNet and GSC-ViT, we incorporated convolutional operations into the transformer encoder and employed dynamic 3-D-CNN. Our methods not only reduced computational complexity but also enable the network to capture more comprehensive local and global spatial-spectral features from HSI. The proposed LGDRNet achieves the best performance on the HH dataset, as shown in Tables V, with 92.01% OA and 89.88% Kappa value. The RSSAN method featured the worst classification accuracy with an OA of only 72.91%, which may be due to RSSAN focusing only on extracting local spatial information, ignoring relationships with long-range information. The proposed LGDRNet effectively captures connections between long-range information using the residual global feature extraction module and effectively extracts local spatial features using the dynamic local feature extraction module, thereby showing superior results relative to existing models.

Experiments on the SA dataset, as shown in Tables VI, demonstrate that the proposed LGDRNet achieved the best performance with an OA of 94.53% and a Kappa value of 93.91%. The morphFormer method's OA and Kappa values were only 91.48% and 90.50%, respectively, i.e., 3.05% and 3.41% lower than those using LGDRNet. This result is due to the flattening operation of morphFormer during the encoding process, which may have destroyed spatial information, leading to insufficient utilization of spatial information. In contrast, our LGDRNet uses convolutional operations for encoding, applying convolutional projection to fully utilize spatial information without destroying it. Although our method is not superior to GSC-ViT in terms of classification performance on specific categories, our OA and Kappa scores are higher than GSC-ViT, demonstrating that our method has strong generalization potential.

In summary, across the three datasets, the proposed LGDRNet achieved the highest and most stable performance across most classification categories, demonstrating the effectiveness and robustness of the proposed LGDRNet.

2) Visual Analysis: Figs. 8–10 provide visual representations of the classification results obtained from the HC, HH, and SA datasets.

Fig. 8.

Classification maps of the HC dataset. (a) Ground-truth map. (b) DFFN. (c) HybridSN. (d) RSSAN. (e) SpectralFormer. (f) morphFormer. (g) LGGNet. (h) FactoFormer. (i) GSC-ViT (j) Proposed.

Show All

Fig. 9.

Classification maps of the HH dataset. (a) Ground-truth map. (b) DFFN. (c) HybridSN. (d) RSSAN. (e) SpectralFormer. (f) morphFormer. (g) LGGNet. (h) FactoFormer. (i) GSC-ViT (j) Proposed.

Show All

Fig. 10.

Classification maps of the SA dataset. (a) Ground-truth map. (b) DFFN. (c) HybridSN. (d) RSSAN. (e) SpectralFormer. (f) morphFormer. (g) LGGNet. (h) FactoFormer. (i) GSC-ViT (j) Proposed.

Show All

Across the three datasets, LGDRNet demonstrated superior performance and achieved the best overall classification results. This is largely because LGDRNet more fully captures local and global information by utilizing the advantages of both CNN and transformer approaches. In contrast, the RSSAN method exhibited the poorest performance, producing classification maps with a significant amount of noise. RSSAN performs poorly in HSI classification because it is unable to capture dependence on long-range spectral information.

Fig. 8 shows that the classification performance of SpectralFormer and GSC-ViT methods is similar to that of the LGDRNet method. However, some noise issues and lack of certain level of stability are noted. The root cause of this performance is that SpectralFormer focuses solely on extracting long-range spectral information and lacks the ability to capture local spatial features, which results in noisy classification outcomes. Furthermore, the GSC-ViT method fails to efficiently integrate CNN with transformer, leading to insufficient extraction of local and global information.

In all, the proposed LGDRNet can achieve more accurate classification for almost all categories, with less noise and better performance than all other methods. This shows the strong generalization ability of the proposed method and demonstrates its effectiveness.

D. Ablation Experiments

To validate the efficacy of the various components included in LGDRNet, we conducted ablation studies on the combination of various components on the HC, HH, and SA datasets. The main components of LGDRNet include multiscale 3-D conv, dynamic local feature extraction, and residual global feature extraction modules. To confirm the efficacy of 3-D dynamic convolution, we also used 3-D regular convolution of the same size for model validation in place of the dynamic local feature extraction module. The 3-D regular convolution employs BatchNorm3-D and ReLU activation functions. OA, AA, and Kappa × 100 were used as evaluation indicators for ablation studies. Tables VII, VIII, and IX show the experimental results.

TABLE VII Impact of Various Modules in LGDRNet on the HC Dataset

TABLE VIII Impact of Various Modules in Lgdrnet on the HH Dataset

TABLE IX Impact of Various Modules in LGDRNet on the SA Dataset

In detail, Cases 1, 3, and 4 evaluated individual components of LGDRNet, directly inputting HSI patches that had been subjected to PCA dimensionality reduction into the respective training components. The experimental results indicated that, relative to Cases 1 and 3, Case 4 exhibited the worst performance across the three metrics on both the HC and HH datasets. These findings indicate that the residual global feature extraction module is inadequate for obtaining local spatial information from HSI, which restricts the representation of local spatial features in the HSI. Furthermore, the dynamic local feature extraction module did not perform as well on the SA dataset as the other two cases. These findings show that the dynamic local feature extraction module may not always be sufficient to obtain spectral information from HSI.

Furthermore, in cases 5, 7, and 9, each two individual LGDRNet components were combined. On all three datasets, Case 7 outperformed the other two cases. These results indicate the necessity of multiscale spectral feature extraction in the model. The performance of Case 9 was inferior to that of Case 7 on the HC, HH, and SA datasets, suggesting that although residual global feature extraction module utilized convolutional operations, its ability to capture local spatial features was weak. Case 11 demonstrated the highest classification metrics across all three datasets. This means that the long-range dependencies of HSI are crucial for the later classification results. Furthermore, the residual global feature extraction module enables the network to fully extract the global spatial-spectral information in HSI. Analysis of the experimental results shows that each component of the proposed method plays an indispensable role in the subsequent classification accuracy.

Finally, the experimental outcomes of 3-D regular convolution on three distinct datasets are shown in Case 2. In comparison to Case 2, Case 3 exhibits superior classification metrics across all three datasets. According to the study, the 3-D dynamic convolution in the dynamic local feature extraction module is capable of enhancing the network's attention to valuable pixels across a variety for samples. In comparison to Case 5 and 7, 3-D regular convolution is used in place of 3-D dynamic convolution in cases 6 and 8. In Case 10, 3-D regular convolution is used in place of LGDRNet's dynamic local feature extraction module. All cases with dynamic local feature extraction modules outperform those with 3-D regular convolution on the three datasets. The experimental results indicate that 3-D dynamic convolution outperforms 3-D regular convolution in local feature extraction.

E. Performance of Different Dynamic Convolution Methods

As indicated in Table X, we evaluate each dynamic convolution method using the HC, HH, and SA datasets and consider OA, AA, and Kappa × 100 in order to further evaluate the performance of the proposed 3-D dynamic convolution method relative to other dynamic convolution techniques. The experimental results show that, compared to the DCMconv [57], DGCconv [58], ODconv [59], and IDconv [62] approaches, the proposed dynamic local feature extraction module produces better classification. It is clear that the dynamic local feature extraction module does not disrupt the 3-D structure of HSI and offers more efficient perception of valuable pixels, which improves its ability to adaptively extract useful local spatial-spectral information.

TABLE X Comparison of Dynamic Convolutions

F. Performance of Different Patch Sizes

On three distinct datasets, we experimented to study how varying patch sizes affect the classification accuracy of LGDRNet. Tables XI shows the results of the OA metric. In particular, we examined how various patch sizes (from 9 × 9 to 17 × 17) influences classification. To assure the stability and reliability of the results, we conducted ten independent experiments and recorded the OA metric.

TABLE XI OA(%) RESULTS for LGDRNet METHOD ACROSS VARIED PATCH SIZES

Our findings showed that the ideal patch size for the examined model differed according to the dataset used. This variation might be due to differences in the characteristics, noise levels, and land cover distribution of the datasets. When the patch size is 15 × 15, LGDRNet exhibits the best classification effect on the HC and HH datasets, with OAs of 89.24% and 92.01%, respectively. When the patch size is 13 × 13, the model performs best on the SA dataset, with an OA of 94.53%. These findings suggest that selecting an appropriate patch size for a specific context is crucial.

G. Performance of Different Train Sample Ratios

To demonstrate the efficacy of LGDRNet, we present the classification performance of these nine techniques at various training sample ratios: 0.05%, 0.075%, 0.1%, and 0.125% for the HH and HC datasets, and 0.1%, 0.2%, 0.3%, and 0.4% for the SA dataset. Fig. 11 displays the experimental outcomes of these nine techniques.

Fig. 11.

OA of different methods with different training samples' percentages. (a) HC dataset. (b) HH dataset. (c) SA dataset.

Show All

The classification performance of LGDRNet consistently outperforms that of the other techniques, and its classification effect grows as the number of training samples increases in the HH, HC, and SA datasets. This demonstrates that the proposed method is more stable and has greater generalizability. Overall, our method demonstrates high classification performance even with limited samples, further confirming its stability and robust generalizability.

H. Complexity Analysis

Floating-point operations (FLOPs), number of parameters (Params), training time, and test time are used as the evaluation metrics for each method on the HC dataset, as shown in Table XII. The results obtained show that the proposed LGDRNet has achieved remarkable results in classification performance. However, LGDRNet still has room for further development in Params and FLOPs. In a broader context, relative to CNN-based methods, most transformer-based methods typically require more time complexity to capture complex feature information. For example, the training times for RSSAN and HybridSN are 13 s and 12 s, respectively. This difference in time is primarily due to the fact that transformer-based methods require a large number of attention calculations to obtain robust feature representations.

TABLE XII Complexity Analysis Results Between LGDRNet and the Comparative Methods

In terms of Params and FLOPs, the proposed LGDRNet method ranks second to the best methods. This is attributed to LGDRNet's powerful information extraction capability, which negates the need to add more layers to fully capture the information concealed in HSI. Furthermore, by introducing convolutional operations within the transformer, Param is reduced, which further decreases the size of LGDRNet.

SECTION V.

Discussion

Across three HSI datasets, the proposed LGDRNet model performs exceptionally well relative to other state-of-the-art techniques. The reason mainly lies in the presence of a dynamic local feature extraction module. The network may effectively extract local spatial-spectral features in HSI using this module, which identifies valuable pixels in various samples without compromising the underlying structure of the HSI. Furthermore, a residual global feature extraction module is employed, allowing the extraction of richer global spatial-spectral features without neglecting crucial shallow attention information, while maintaining a lightweight network. Our technique exhibited improvements in OA of 3.54%, 3.48%, and 1.68%, respectively on the HC, HH, and SA datasets, when compared to existing HSI classification networks.

Furthermore, the sample class distributions in HSI datasets are often unbalanced, resulting in a lack of training examples for some classes. In HSI, the LGDRNet model is more effectively able to use both local and global spatial-spectral information. This ensures that the LGDRNet model can function successfully even in cases when certain classes have very few training samples. For example, the OA of our method is 8.49% higher than other advanced approaches for the Water Spinach class in the HC dataset, which has only one training sample. Likewise, relative to previous advanced approaches, our approach increased the OA for the Celtuce class in the HH dataset by 15.54%. We utilize the advantages and unique characteristics of both the CNN and the transformer when building a hybrid network, giving the network superior learning capabilities. However, our LGDRNet model has certain limitations. For example, it is not effective for partially obstructed ground objects, e.g., the film covered lettuce in the HH dataset. It is, therefore, necessary to undertake additional research and investigation to make the model more effective.

SECTION VI.

Conclusion

In this article, we propose the LGDRNet method for HSI classification. This method effectively combines CNN and transformer approaches to extract multiscale spectral features from HSI while retaining the ability to capture spatial features, providing richer spectral information for subsequent modules, and also considering the long-range dependence of spectral and spatial features. In addition, we designed a dynamic local feature extraction module and a residual global feature extraction module. The dynamic local feature extraction module dynamically extracts local spatial features based on different HSI samples, increasing the focus on valuable pixels within 3-D samples. The residual global feature extraction module effectively utilizes attention information from multiple attention layers, preventing the disappearance of shallow spatial-spectral features as the number of attention layers increases. We demonstrate the effectiveness of the LGDRNet method by conducting experiments on three public datasets.

However, parameters and FLOPs of the model still require further development, despite its high classification accuracy. Future work should concentrate on CNN and transformer hybrid approaches that are lighter and faster, and able to extract more comprehensive local and global spatial-spectral information in HSI. Furthermore, it would be useful to build more efficient networks to lessen the dependency on HSI samples.

References is not available for this document.

Local-Global Feature Extraction Network With Dynamic 3-D Convolution and Residual Attention Transformer for Hyperspectral Image Classification

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction