Dual-Branch Spectral–Spatial Adversarial Representation Learning for Hyperspectral Image Classification With Few Labeled Samples

Recently, deep learning methods, particularly the convolutional neural networks, have been extensively employed for extracting spectral–spatial features in hyperspectral image (HSI) classification tasks, yielding promising results. Conventional methods often use small image patches as input and combine spectral and spatial features with fixed strategies. However, the equal treatment of all pixels within heterogeneous patches can negatively impact feature extraction performance. In this article, we propose a semisupervised dual-branch spectral–spatial adversarial representation learning (SSARL) method for HSI classification. SSARL adaptively assigns attention weights to different pixels and adds a spectral constraint to spatial features. Our approach consists of three main components: 1) a dual-branch framework designed to independently extract spectral and spatial information from pixel and patch samples; 2) a class consistency loss that adaptively combines spectral and spatial classification results, mitigating the adverse effects of heterogeneous patches and enabling appropriate feature selection for various situations; and 3) the deep learning model on the labeled sample size by adding the adversarial representation module and conditional entropy to two branches, reducing the deep learning model's reliance on labeled sample size. Experimental results demonstrate that SSARL outperforms competitive methods on small-sized (0.3%–5%) labeled samples and exhibits superior performance for boundary test pixels.


I. INTRODUCTION
H YPERSPECTRAL imaging is a type of remote sensing technology that captures abundant spectral and spatial information. Unlike conventional RGB images, hyperspectral images (HSIs) are 3-D form of images, which enable a wide range of applications [1], [2], including modern agriculture [3], aviation industry [4], security [5], and biomedicine [6]. HSI classification, an essential process in remote sensing, discriminates ground objects with unique spectral characteristics. Although HSIs contain a large number of spectral bands that provide rich information, they also introduce redundancy and noise [7], [8].
Consequently, researchers have focused on effective feature extraction methods. Traditional supervised methods [9] typically transform high-dimensional data into low-dimensional features and design manual features based on prior knowledge [10]. However, features obtained through traditional methods rely heavily on expert experience, which often results in low classification accuracy for practical applications.
Recently, deep-learning-based methods have demonstrated improved performance in HSI classification due to their powerful feature extraction capabilities. They automatically extract deep and discriminative features, overcoming the limitations of traditional methods. Examples include stacked autoencoders (SAEs) [11], deep belief networks (DBNs) [12], [13], [14], convolutional neural networks (CNNs) [15], and generative adversarial networks (GANs) [16], [17], [18], [19]. The aforementioned methods primarily extract spectral features from individual hyperspectral pixels. In addition, numerous studies have shown that incorporating spatial information into classifiers can effectively enhance performance [20]. Spatial features address two key challenges: 1) high-dimensional spectral features not only contain abundant information but also introduce redundancy and noise-by operating on all the pixels within an image patch and extracting features, noise and errors can be effectively reduced; and 2) the same land cover types often exhibit distinct spatial structures, while within-class spectral differences can lead to variations in spectral-spatial features exploiting the correlation between neighboring pixels within a patch, thus mitigating the impact of spectral changes [21]. The core concept of spatial features involves fusing features from all the pixels within an image patch and treating them as central pixel features. Utilizing patch samples and designing spatially structured models are approaches to obtain spatial features at present. Yang et al. [22] designed a two-channel CNN structure, with one channel for spectral feature extraction and another for spatial feature extraction. Two types of features are concatenated and sent to a fully connected layer. Li et al. [23] used the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 3-D CNN to directly extract spectral-spatial features from HSI patch samples. Similarly, the discriminator of the 3-D GAN [24] can classify samples and determine their authenticity based on extracted spectral-spatial features. Fang et al. [25] proposed a new multiclass GAN that combines spectral and spatial features. To reduce model complexity and enhance spatial feature abstraction, HybridSN proposed by Roy et al. [26] consists of 3-D CNN and 2-D CNN layers. An image patch sample passes through three 3-D CNN layers and one 2-D CNN layer successively to obtain a spectral-spatial joint feature. Moreover, models based on attention mechanisms [27], [28], [29] can extract global features of images. SSFTT [30] first inputs 3-D image patches to a CNN, and the output feature maps are divided into semantic patches. These patches are then input to a transformer-based encoder. Deep learning HSI classifiers that utilize spectral-spatial features and patch samples have achieved impressive results. However, deep-learning-based methods typically require a large number of labeled training samples to optimize the abundant parameters of deep models and avoid overfitting. In addition, spatial features also have inherent shortcomings, which will be analyzed in detail in the following.
For HSI spectral-spatial classification with a small number of labeled samples, semisupervised learning is considered a promising approach. Semisupervised learning aims to extract information from a large number of unlabeled samples. Sun et al. [31] proposed a semisupervised algorithm that combines clustering and manifold techniques. Seydgar et al. [32] designed a semisupervised framework capable of generating reliable fake labels, which are effective for various deep learning models. The semisupervised method based on the folded spectrum GAN [33] folds the original spectral vector into a 2-D square as the input of the GAN. Similarly, HSGAN [34] extracts spectral features using a custom 1-D GAN and employs a novel CNN framework for classification. A specialized voting strategy is utilized to enhance performance. DAE-GCN [35] introduces a spectral-spatial graph to train a graph convolutional network using a semisupervised strategy. Tang et al. [36] proposed a method for extracting multiscale spatial-spectral features based on a ladder structure. The complexity of hyperspectral data distribution, however, still limits the performance of semisupervised models.
In addition to the issues mentioned above, the use of image patch samples and spatial-spectral features presents challenges related to vague boundaries and misclassification. First, some methods assume that all the pixels within an image patch contribute equally to the classification of the central pixel. However, realistic image patches can be divided into homogeneous and heterogeneous patches: those consisting of the same class of pixels and those containing multiple classes of pixels. Spatial features extracted from homogeneous patches can enhance the classification performance by introducing spatial relationship and suppressing noise. In contrast, spatial features extracted from heterogeneous patches can be viewed as the fusion of pixels from different classes. Consequently, the extracted spatial features from heterogeneous patches may not accurately represent central pixels, limiting their classification performance [37]. The influence of spatial features can be mitigated by properly emphasizing spectral features, which focus on the spectral vector itself. Second, the existing research on spectral-spatial features primarily relies on fixed strategies for fusing the two types of features [21], such as concatenating feature vectors [38], [39], [40]. Considering the characteristics of heterogeneous patches, these fixed strategies may result in reduced performance; especially, patches at boundaries are typically heterogeneous. Adapting the combination of two features could alleviate this issue, such as adding spectral constraints to guide the assignment of attention weight. Therefore, effectively utilizing both types of samples and features remains a critical challenge.
In order to extract deep adaptive spectral-spatial features from various image patches and address sample scarcity and imbalance, we proposed a semisupervised spectral-spatial-dependent learning framework that combines the GAN and the global joint attention mechanism, named dual-branch spectral-spatial adversarial representation learning (SSARL). An adversarial representation module is incorporated to handle limited labeled samples, while the dual-branch structure and class consistency loss offer a novel strategy for adaptively combining spectral and spatial features. The characteristics of pixel and patch samples are also considered. The contributions of this article are summarized as follows.
1) We propose a learnable dual-branch framework that extracts all the useful spectral and spatial features by processing pixel and patch samples independently in parallel. 2) We introduce a loss function called class consistency loss, which replaces the existing feature fusion strategies. This function adds spectral constraints and adjusts the attention weights of spectral and spatial branches adaptively, allowing the learned framework to perform well on heterogeneous patches. 3) We apply an adversarial representation module for spectral and spatial feature extraction. Through the adversarial process, robust features are learned from limited labeled samples. The rest of this article is organized as follows. Section II presents the details of the proposed SSARL. Section III showcases the results and analysis of our experiments. Finally, Section IV concludes this article.

II. METHODOLOGY
In this section, first, we briefly introduce the proposed SSARL. Second, the adversarial representation module is illustrated. Then, the proposed class consistency loss is given. Finally, the details of complete spectral and spatial branches are introduced.

A. Overview of the Proposed Model
An HSI dataset can be represented as H ∈ R h×w×b , where h, w, and b represent the height of spatial size, the width of spatial size, and the number of spectral bands, respectively. The dataset contains N labeled pixel samples. x spe ∈ R 1×1×d represents the spectral sample; y spe ∈ R 1×1×c represents the corresponding one-hot label, where c denotes the number of classes. x spa represents the patch sample with a size of m × m × d, where m represents the height and width. y spa indicates the one-hot label of corresponding central pixel in the patch. The collection of the labeled pixel samples and labels is represented as L spe , and the unlabeled collection is U spe . Sample pairs (x spe , x spa , y) and (x spe , x spa ) are inputs to the model. The SSARL framework is illustrated in Fig. 1. The proposed model contains a spectral branch (the upper half of Fig. 1) and a spatial branch (the lower half of Fig. 1). Each branch consists of an encoder (E) based on the CNN and a classifier (C), including a fully connected layer and Softmax. Instead of learning hierarchical spatial-spectral features or concatenating spectral and spatial features [21], the spectral branch extracts the spectral feature from labeled and unlabeled pixel samples, and the spatial branch extracts the spatial feature from labeled and unlabeled patch samples. To extract robust features from limited labeled samples, we apply adversarial representation modules (the middle part of Fig. 1) based on the GAN to two branches. The proposed class consistency loss unifies the results obtained by two classifiers and discriminators.
The dual-branch structure makes full use of spectral and spatial features from pixel and patch samples. Inspired by the GAN, we insert an adversarial representation module. This module contains a generator (G) and a discriminator (D). Through the adversarial process, the module uses limited labeled samples to enlarge the sample space and increases sample diversity, thus preventing overfitting. Finally, the proposed class consistency loss adds additional constraints to two discriminators and classifiers. The proposed loss function can make use of features adaptively rather than adopting fixed spectral-spatial combination strategies. Meanwhile, dual-branch structure and class consistency loss can reduce the negative impact of spatial features extracted from heterogeneous patches (e.g., a boundary patch is composed of multiclass pixels, and the representation ability of spatial feature is weakened, so more attention should be paid to spectral feature from central pixel). The aforementioned parts will be illustrated in the following sections.

B. Adversarial Representation Module
GANs have been widely used for data augmentation for natural image processing in computer vision [41]. They can maintain an identical distribution as original samples and increase the diversity [42], [43], [44], [45]. The adversarial representation module utilizes the adversarial process to enhance the extracted semantic features. Instead of reconstructing samples at pixel level through the root-mean-square error (e.g., SAE), the adversarial representation module can be seen as a sample construction through variable constraint based on the CNN and the GAN. The proposed adversarial representation module can be applied to pixel-spectral feature and patch-spatial feature extraction. During the adversarial process, multiple mappings from correct features to potential samples are learned. The encoder is also guided to map sample space to semantic feature space at the class level.
To satisfy Lipschitz continuity, we adopt spectral normalization (SN) [46] for the discriminator. The Lipschitz continuity is defined as the gradient's rate being less than K, formulated as follows: where D is the function of discriminator, and x 1 and x 2 are two variables close enough. 2 represents the Euclidean norm. SN can be formulated as follows: where sup represents the upper bound. When x is considered within minimum neighborhood, the nonlinear discriminator function D can be regarded as a linear function Under this condition, Zhang et al. [47] proved that applying SN on multilayer can meet Lipschitz continuity. SN normalizes the parameter matrix by dividing it by the maximum singular value of the parameter matrix on every layer. For a fully connected network layer, SN directly calculates the maximum singular value of the secondorder matrix. For a convolution layer, the parameter matrix , and the maximum singular value is calculated by the iterative method.

C. Class Consistency Loss
CycleGAN [48] proposed a consistency loss to guide the mapping between the source domain and the target domain. For the HSI, the input pair samples (x spe , x spa ) belong to the same class, though they have different forms. Therefore, the outputs of two branches should ideally be coincident. However, spectral and spatial features have their own pros and cons, which may lead to different classification results. In order to combine the advantages of both the features in different situations, we proposed a result-driven loss function to assign different attention weights to two features, named class consistency loss, as shown in Fig. 2. The class consistency loss is the distance between two results from two branches. The root-mean-square error is used to measure this distance. The class consistency loss is defined as follows: where F spe and F spa represent the models of the spectral branch and the spatial branch, respectively. The class consistency loss is added as a constraint when training. The calculated loss value uses stochastic gradient descent to update parameters. If the two prediction results are the same (whether it is right or wrong), the class consistency loss is close to 0 and fine-tunes the network. If the prediction results are different, only one result can be selected randomly as the final result; thus, the final prediction results could be worse. In this case, the consistency loss guides to adjust the parameters. In the classification stage, the role of class consistency loss is to use networks to achieve adaptive voting on the two prediction results obtained from spectral and spatial classifiers. As for the discriminator, which is equivalent to a multiclass classifier, the class consistency loss plays the same role. In the generation stage, the class consistency loss also restricts the samples generated by the two generators to be the same class because the spectral and spatial feature generators used come from the same class. For feature extraction, spectralspatial features with bias and different contributions of pixels in one patch are learned. We aim to learn the spectral-spatial attention weights and distinguish contributions from different pixels, thereby improving the classification accuracy of boundary samples.

D. Spectral Pixel and Spatial Patch Branches
Based on the adversarial representation module and class consistency loss, the spectral and spatial branches are designed to exploit spectral and spatial information. The input of the spectral branch is L spe , U spe ∈ R, and the input of the spatial branch is L spa and U spa . The encoder maps samples x spe (x spa ) to features f spe (f spa ). Instead of random noise, the features extracted by the spectral (spatial) encoder are used as the input of the spectral (spatial) generator. The fake samples are denoted asx spe ∼ G spe (f spe ) andx spa ∼ G spa (f spa ). Then, the following three parts are input to the spectral discriminator: (x spe , y spe ) ∼ L spe , x spe ∼ U spe , andx spe ∼ G spe (f spe ). Similarly, the following parts are input to the spatial discriminator: (x spa , y spa ) ∼ L spa , x spa ∼ U spa , andx spa ∼ G spa (f spa ). After training, the test samples flow through the trained encoders and classifiers. The configuration of the spectral branch is shown in Fig. 3. 1-D convolution and 1-D transposed convolution are widely used. The size of convolutional kernel is k and stride is s. p and Op stand for padding. O represents the number of kernel. SN represents the spectral regularization. The spatial branch modifies the 1-D modules into 2-D modules.
The proposed model employs a semisupervised method. To utilize the unlabeled samples, we add conditional entropy to the objective function. The specific label of a real unlabeled sample is unknown, but it should belong to a certain class; therefore, conditional entropy is added as a prior condition to enhance the performance of the classifier. The equation for conditional entropy is shown as where λ represents the hyperparameter. Given real labeled sample pairs, the purpose of the spectral (spatial) discriminator is to classify them correctly. For real unlabeled samples, the purpose of the discriminator is to assign the proper classes. It will classify the real samples into C classes. The posterior probability is represented as follows: where θ represents the parameters of discriminator, and f = D θ (x spe ) represents the output feature of the discriminator. Essentially, the discriminator is a modified classifier. The role of the discriminator also includes discriminating the authenticity of sample. The formula to calculate the samples belongs to the real set is We can assume the generated samples as class C + 1, and it can be represented as 1 in denominator. Then, the samples that belong to fake set can be shown as The objective function for the discriminator and the classifier in the branch consists of the class cross entropy of labeled samples and the conditional entropy of unlabeled samples. It can be predicted that the performance of the network may be worse when initially updating the network parameters with unlabeled samples. Formula 6 can be translated to the following equation when optimizing: where x spe represents the random variable conforming the distribution of U spe , and E represents the expectation. Then, we calculate the negative gradient using the following equation: (9) During the updating of network parameters, the predicted result p(c|x spe ; θ) is strengthened. The neurons related to clasŝ c = arg max[p(c|x spe ; θ)] are stimulated to update parameters. The objective functions for the classifier, generator, and discriminator in the spectral pixel branch are expressed as follows: In formula (10), the first two terms represent the cross entropy for classifying labeled samples and the conditional entropy for unlabeled samples. In formula (11), the objective function of the generator is responsible for making the discriminator misjudgment. For the objective function of the discriminator in formula (12), the first term aims to classify the labeled samples correctly. The second term assigns the dominant class of unlabeled samples using conditional entropy. The third and fourth terms judge whether the samples are real or fake. The final terms of the three formulas represent the class consistency loss between spatial and spectral branches. Here, β represents the adversarial weight. If the basic classification performance of the classifier is good and predicts correctly, the training process will develop in the right direction. However, the number of labeled training samples is limited, and the model may not be sufficiently learned. Therefore, it is possible that the initial performance of the classifier is poor, leading to many incorrect predictions, and the error is magnified through the update process of Formula (9) and conditional entropy. To address this issue, initial λ is set to a small value and gradually increases as the training progresses.
The spatial and spectral branches propagate forward at the same time and backpropagate after calculating their respective loss values with the objective function. Within each branch, E, G, D, and C are optimized alternately. Considering E and G as a whole for updating parameters, we optimize the network parameters of D when the network parameters of E, G, and C are fixed. Conversely, when D is fixed, the rest parameters are updated. We use the Adam optimizer. Through this end-to-end network structure and alternating optimization, classifier C is finally used to classify test samples.

III. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, the datasets, configuration, and hyperparameters are introduced first. Second, we analyze the influence of the proposed dual-branch structure, adversarial learning, and class consistency loss and analyze the sensitivity of the model to the number of labeled samples. Third, we compare the performance of the proposed model with that of competitive methods. Finally, we discuss the comparison results.
1) Indian Pines: The IP dataset was gathered by the AVIRIS sensor in the northeast of Indiana. The IP scene consists of two-thirds agriculture and one-third forest or other natural  Table I shows the number of each class of four datasets. For IP, we randomly select 5% of the labeled samples as labeled training set. For PU, SA, and WHU, the proportions of selected labeled samples are 3%, 1%, and 0.3% of whole samples. The random selection follows the class balance. All input data are normalized between −1 and 1 in advance.
Particularly, we define boundary samples as pixels that differ from any of the eight surrounding pixels, as shown Fig. 5.

B. Experimental Setting
The whole experiments are conducted on a computer equipped with an NVIDIA GeForce GTX 1080Ti with 12-GB RAM. The software environment is Ubuntu 14.04 ultimate  TABLE I  LAND-COVER CLASS INFORMATION AND THE NUMBER OF ANNOTATED SAMPLES OF IP, PU, SA, AND WHU   TABLE II  The samples are processed by a Gaussian smoothing kernel before being input to the model. In the spatial branch, the spectral dimension of HSI patches is reduced to 10 by PCA, and the size of patches is set to 8 × 8 for IP, PU, and SA, and 9 × 9 for WHU.
During training, we use the batch size of 32. An annealing algorithm is considered for setting the learning rate, with a range of [0.0, 0.002]. The conditional entropy weight λ determines the effect of unlabeled samples. We consider the λ values in the range [0. 5,1]. For every 100 training steps, λ increases by 0.05. The adversarial weight β with the value range of [0.5,1] increases by 0.05 every 100 training steps. The number of training steps is 1000. The network parameters are presented in Section II. The above parameters are adjusted using a standard random grid search cross-validation framework. F1-score, overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa) are used to quantitatively evaluate models. The results are obtained after ten independent runs, with the training and test sets randomly divided each time. The FLOPs and parameters of SSARL (Input 8 × 8 × 10) are shown in Table II. D represents the discriminator and EGC represents the thread process.

C. Ablation Study
To demonstrate the effectiveness of the semisupervised strategy, spectral adversarial learning branch, and class consistency loss, we compare CNN, CNN-CE, CNN-CE-SS, and proposed method. CNN represents the method with the same structure and configuration as the encoder and the classifier in the proposed spatial branch but uses a standard cross-entropy loss function. Therefore, CNN does not utilize unlabeled samples. CNN-CE represents a semisupervised model that adds conditional entropy for training. CNN-CE-SS represents a model that adds spectral and spatial branches based on CNN-CE without class consistency loss, and the structure and configuration are the same as those of the proposed method.
The results of OA, AA, and Kappa are shown in Table III. The overall classification results presented in the table (from left to right) increase with the increase of innovations. Compared with CNN, the OA has improved by 0.5%, 1.0%, 0.8%, and 0.6% after utilizing unlabeled samples, which proves that the information is mined from a large number of unlabeled samples. Compared with CNN-CE, CNN-CE-SS shows a significant improvement of OA in IP and WHU, a slight improvement in PU, but a decline in SA due to the introduction of spectral features. It can be inferred that adversarial representation learning modules mitigated the sample imbalance by generating fake samples in IP. The decline demonstrates the possible shortcomings of spatial features without applying combination methods. Compared with CNN-CE-SS, the proposed method has improved OA by 0.34%, 0.01%, 0.21%, and 0.50%. The OA of SA and PU does not vary significantly, but proposed method performs better on AA and Kappa. It proves that the class consistency loss inhibits the influence of heterogeneous patches and extracts valuable joint features by adding spectral constraint adaptively. On the premise that SSARL achieves better performance, Table III presents two outliers. First, CNN has achieved better performance on AA in IP. The second issue is that CNN-CE-SS performed worse than CNN-CE in SA. IP and SA are obtained from the same series of hyperspectral sensors. It can be inferred that caution is needed in the use of spatial-spectral features in SA and the imbalanced unlabeled samples in IP.

D. Sensitive Analysis of the Number of Labeled Samples
The number of labeled training samples greatly affects the classification performance of deep learning methods. Therefore, we analyze the performance of the proposed method and other methods using different numbers of labeled samples.   on three datasets decreases with the decrease in the number of labeled samples. First, the curve of SSARL is above other curves, which proves that our method has the highest accuracy. Second, the curve of SSARL is stablest, which demonstrates that it is the least sensitive to the sizes of labeled samples and can perform better. In SA and PU, the proposed method is almost unaffected by the sizes of labeled samples. Therefore, our method is a better choice when the number of labeled samples for training is limited. It is worth noting that MSGAN performs well, which may be attributed to the ability to expand the sample space effectively with a small number of labeled samples from GAN-based methods.
1) Indian Pines: The classification results of the IP dataset are shown in Table IV. This table records the average classification accuracy and standard deviation in ten independent runs. The last 16 rows record the classification F1-score of the corresponding class. Compared with RBF-SVM, 1-D CNN have extracted deep spectral feature from pixel samples. Considering the spatial feature and image patch samples, HybridSN and 3-D CNN utilize the spectral-spatial feature. The OA of 3-D CNN shows a 29% improvement compared to 1-D CNN. Furthermore, compared with 3-D CNN, which uses 3-D convolutional layers to extract spatial-spectral features from patch samples, RSEN and DBR utilize pixel, patch samples, and unlabeled samples to obtain information, resulting in a 3.4% improvement in the OA of RSEN compared to HybridSN. SSFTT extracts global features, which results in an improvement of 5.7% OA compared to RSEN. Classifiers usually perform worse in IP when the number of labeled samples is limited. Through the adversarial process, the encoder extracts robust spatial and spectral features. The dual-branch structure and class consistency loss ensure the performance on heterogeneous samples. Compared with SSFTT, SSARL has improved OA, AA, and Kappa by 1.2%, 0.2%, and 1.5%, respectively. The proposed method also achieves the optimal classification results in 14 classes, especially in classes 1, 9, 13, and 16, which have a small sample size. Fig. 9 shows the classification maps of different methods in the IP dataset. First, maps based on deep learning using  spectral-spatial features have fewer dot noises. However, due to complex spatial information at the boundary, a large number of misclassification points are presented. The map produced by SSARL is closest to the ground truth map. It demonstrates that proposed method can combine spectral and spatial features adaptively, thereby classifying test samples more accurately.
2) Pavia University: The classification results of the PU dataset are shown in Table V. The distribution of samples in PU is more scattered than IP, making it easier to classify. First, SSARL performs better than the other seven models on OA, AA, and Kappa. The performance of RBF-SVM, 1-D CNN, 3-D CNN, HybridSN, RSEN, and DBR improves progressively due to spectral-spatial features and unlabeled samples. The 1.9% OA improvement proves that using unlabeled samples for semisupervised training can effectively improve the classification performance. Compared with 3-D CNN, the proposed method has improved OA, AA, and Kappa by 4.7%, 8.5%, and 6.0%, respectively. Compared with SSFTT, SSARL has improved OA, AA, and Kappa by 1.1%, 1.5%, and 0.3%, respectively. The proposed method has eight classes (a total of nine classes) achieving the best classification results, with six classes achieving entirely correct classification results. Performance in the eighth and fifth classes has improved. Fig. 10 shows the classification maps of different methods in the PU dataset. It is consistent with the conclusions in the IP dataset. SSARL has achieved better regional consistency and boundary performance.
3) Salinas: The classification results of the SA dataset are presented in Table VI. SA is a relatively easier dataset to classify than IP. Therefore, all the methods achieved higher classification results than those in the IP dataset. First, under the three evaluation criteria OA, AA, and Kappa, the classification performance of SSARL is better than that of the other seven methods. However, we found that RSEN did not perform well, which may be due to a large number of unlabeled samples negatively impacting classification. Therefore, our method gradually increases the loss weight of unlabeled samples during training. Compared with the  competitor method SSFTT, SSARL has improved OA, AA, and Kappa by 0.8%, 0.78%, and 0.99%, respectively. SSARL has achieved the best classification accuracy on all the classes. It has achieved 100% accuracy in 13 of them. For class 16, the performance has been significantly improved. Fig. 11 shows the classification maps of different methods on the SA dataset. SSARL can distinguish samples from the 8th and 15th classes more effectively. It shows that SSARL can better classify the boundary samples and reduce the classification error points within the class.
Furthermore, we compared the performance of several latest models based on GAN. The comparison results are presented in Table VII. HSGAN uses spectral samples, while 3-D GAN and ARL-GAN use a spatial image patch trained model for classification. MSGAN is a spectral-spatial method. Compared with HSGAN, SSARL has increased OA by 24.48%, 14.69%, and 11.70% on three datasets. Compared with 3-D GAN, SSARL increases OA by 3.13%, 1.64%, and 1.11% on three datasets. Compared with MSGAN, SSARL has increased OA by 2.65%, 0.69%, and 0.88% on three datasets. These results demonstrate that adversarial representation model, class consistency loss, and the dual-branch structure contribute to better classification accuracy.

4) WHU-Hi-LongKou:
The classification results of the WHU dataset are presented in Table VIII. The size and data amount of WHU is larger than those of the above datasets. As shown in the table, SSARL outperforms other competent methods in terms of OA, AA, and Kappa. DBR, which uses a    1-D and 2-D pretrained network, achieves the best performance in four classes. Fig. 12 displays the classification maps of different methods in the WHU dataset. It can be observed that SSFTT and RSEN produce regular misclassification points within class regions. The complex boundary even leads to the misclassification of background pixels.

F. Classification of Boundary Samples
SSARL focuses on improving the accuracy of boundary samples, which is a disadvantage of spatial feature from heterogeneous patches, and utilizes a different strategy of feature utilization. Therefore, in this section, we analyze the OA of  Fig. 13, where gray circles point to their magnified detail. And the quantitative classification performances are shown in Table IX.
The class boundaries of IP and SA are flatter, while those of PU and WHU are more irregular. The classification results in IP and WHU are worse than SA due to their complicated and unbalanced samples. Although the class boundaries of PU are complex, the classification result is fine. Compared with Section III-E, the OA of boundary test samples is lower than that of the entire test sets, 7.9%, 3.8%, 2.3%, and 13.4% lower on four datasets from HybridSN, and 7.3%, 2.1%, 4.1%, and 16.1% lower on four datasets from SSFTT. Therefore, it is proved that the spatial information of heterogeneous samples at the boundary is susceptible to be influenced by neighbor classes, which reduces the effectiveness of spatial features. Compared with other methods, SSARL extracts robust spectral and spatial features from the two branches, which are utilized by adding class consistency loss instead of concatenating them fixedly. Therefore, SSARL can adapt to both within-class and class boundary situations. The percentage of correct classification for SSARL is higher than that of other best methods by 1.8%, 0.1%, and 1.7%. On WHU, DBR achieves the best performance.

IV. CONCLUSION
In this article, we proposed a dual-branch SSARL for HSI classification based on a generative adversarial network. This method mainly focuses on training with limited labeled samples and utilization of spectral-spatial feature. Especially, we considered the relationship between pixel samples and complex heterogeneous image patch samples. We improved the ability of extracting feature from labeled and unlabeled samples by adding adversarial process. Two branches were, respectively, responsible for generating pixels and image patches and extracting their features. The class consistency loss was proposed to combine two branches. The experiments comprehensively proved the effectiveness of two-branch structure and class consistency loss. Compared with competent methods, SSARL performed better on four datasets. Moreover, the proposed SSARL aimed at improving the classification performance of boundary samples, which is often overlooked but has a negative impact on the overall classification results. We believe that there are two limitations to the proposed method. First, the training process is unstable. Although the proposed method can perform well, the constraints brought up by loss functions increase the difficulty of training. The second point is that the structure of the encoder is slightly simple, while some scenarios may require stronger feature representation capability.