Two-Branch Generative Adversarial Network With Multiscale Connections for Hyperspectral Image Classification

Hyperspectral image (HSI) classification has always drawn great attention in the field of remote sensing. Various deep learning models are in the ascendant and gradually applied to HSI classification. Nevertheless, limited-labeled and class-imbalanced datasets largely make the classifier prone to overfitting. To address the above problem, this article proposes a two-branch generative adversarial network with multiscale connections (TBGAN), which includes two generators to produce the spectral and spatial samples, respectively. Thereinto, the spectral generator is imbued with the self-attention mechanism to maximumly capture the long-term dependencies across the spectral bands. And meanwhile, an elaborated discriminator with two branches is devised in TBGAN for extracting the joint spectral-spatial features. Besides, the multiscale connections are placed between the discriminator and two generators to alleviate the instability problems caused by the inherently backward propagation of gradients in GAN. Furthermore, a feature-matching term is added to the loss function to prevent the generators from overtraining upon the current discriminator, thereby further improving the stability of the network. Experiments upon three benchmark datasets demonstrate that TBGAN achieves an extremely competitive classification accuracy and exerts lower sensitivity to the training sample size compared with several state-of-the-art methods.

manner and extract the hierarchical features simultaneously, forming a unified end-to-end framework. Consequently, deep learning has gradually become a powerful tool for HSI classification in recent years [13]. Typical deep learning models include convolutional neural network (CNN) [14], [24], stacked autoencoder (SAE) [25], recurrent neural network (RNN) [26], and deep belief network (DBN) [27]. Among the abovementioned deep learning-based methods, the inputs of RNN, SAE, and DBN are composed of spectral vectors, without containing the spatial features, thereby largely giving rise to unsatisfactory classifications. Nevertheless, CNN can simultaneously extract the spectral and spatial features of HSI and adopt the strategies of local connections and weight sharing to reduce the number of parameters, drawing great attention in the field of HSI classification. Hu et al. [14] first designed a five-layer CNN with the spectrum of each pixel as input and extracted the spectral features to perform the classification. Besides, a pixel-pair voting strategy enabled a one-dimensional convolutional neural network (1D-CNN) to achieve a promising classification result in the case of limited training samples [15]. However, due to the lack of texture and context information of the samples, 1D-CNN is prone to suffer from misclassification. Therefore, some scholars [16], [21] have introduced spatial features into the network to construct joint spectral-spatial frameworks, which can be roughly divided into two categories. One paradigm is to construct a two-branch structure, in which each branch extracts the spectral or spatial features respectively, and then concatenates these features for the classification [16], [18]. For example, Xu et al. [16] developed a spectral-spatial unified network (SSUN), employing a long short-term memory model (LSTM) and multiscale convolutional neural network to respectively extract the spectral and spatial features. The other paradigm receives the 3D cubes containing the spectral and spatial information and extracts the joint features by one or more convolutional operators [19], [21]. For instance, the multiscale 3D deep convolutional neural network (M3D-DCNN) [20] utilized 3D convolutional operators to extract the multiscale spatial and spectral features, announcing impressive results. In addition, there are also some studies [22], [24] that combine CNN with selfsupervised learning, using a large number of unlabeled data and achieving promising classification results.
Although CNN-based methods have achieved excellent classification results, they are prone to overfitting when tuning the substantial learnable parameters with limited training data [28], [29]. However, gathering data is expensive and time-consuming in the field of remote sensing, and the obtained data generally take on long-tail distribution, which hinders the application of CNN.
Generative adversarial network (GAN) [30] was put forward to generate high-quality images through its unique adversarial training process between the generator and discriminator. With the advancement of GAN, hundreds of its variations have been derived. Among them, the relatively popular models are conditional generative adversarial network (CGAN) [31], deep convolutional generative adversarial network (DCGAN) [32], and Wasserstein GAN [33]. To alleviate the above overfitting problem of CNN-based methods, some scholars [34], [44] introduced GAN into HSI classification, which yields encouraging classification results under the circumstance of small-size samples. Zhan et al. [34] proposed a semi-supervised classification method based on 1D-GAN, which is the first application of GAN for HSI classification. A DCGAN-based method was proposed, in which the discriminator leveraged the first three principal components after the operation of principal components analysis (PCA) upon the original image as the inputs, with commendable classification results obtained [35]. Zhan et al. [36] further classified the samples via the voting mechanism of the dynamic neighborhood after the first classification using the spectral feature only. A novel multiclass spatial-spectral GAN (MSGAN) method [37] was developed with two generators to produce the fake spectral and spatial samples, respectively, and defined the novel adversarial objectives for multiclass, which achieves astounding results. For the sake of excavating the rich information from unlabeled samples, the generator network in multitask GAN (MTGAN) [38] was designed to simultaneously undertake the reconstruction and the classification tasks. To improve the generalization performance, a self-attention-based GAN [39] was combined with the variational auto-encoder (VAE) [45], in which the generator received the encoder-generated and random latent vectors to produce more enhanced virtual samples.
Even if the above GAN-based models have gained satisfying HSI classification performance, the training quality of models hinges on the gradients transmitted from the discriminator to the generator. Hence, the gradients may disappear due to accumulation when the layer of GAN is too deep. Furthermore, Arjovsky and Bottou [46] put forward the point that when there is an insubstantial overlap between the distribution of the real and the generated data, the discriminator will pass uninformative gradients to the generator. The above problems are the major contributors to the training instability of GAN, which hinders its classification accuracy. To improve the training stability, a multiscale gradients GAN (MSG-GAN) [47] was developed for synthesizing the highresolution faces, which connected the intermediate layers of the generator with that of the discriminator, making the multiscale gradients can be directly passed from the discriminator to the generator. To solve the training instability problem of GAN for the task of HSI classification, this article establishes the multiscale connections between the discriminator and generators inspired by MSG-GAN. The main contributions of this article are summarized as follows.
1) We propose a two-branch generative adversarial network with multiscale connections (TBGAN) for HSI classification. Generators in TBGAN will produce the virtual spectrums and spatial patches to alle-viate the small-size sample problem. 2) To improve the training stability, the multiscale connections are established between the discriminator and two VOLUME 11, 2023 generators. Moreover, a feature-matching term is added to the loss function to further increase the stability.
3) The discriminator with two branches is designed in TBGAN to extract the joint spectral-spatial features. The trained discriminator can be employed as a classifier.

A. BASIC FRAMEWORK OF GAN
Before formally introducing the TBGAN method, we first review the basics of GAN. Motivated by the two-person zerosum game theory, the GAN model [30] is proposed by taking the adversarial training process to optimize deep learning models as a new framework, which consists of a generator G and a discriminator D, as exhibited in Fig. 1. G tries to capture the potential distribution of real data and output the fake data, while D undertakes a binary classification task that can judge whether the input sample is real or not. Specifically, G takes a random noise z as input and attempts to generate the fake data G(z). D uses the real data x or the fake data G(z) as input and outputs the probability of the input attributable to the real data. The objective function of GAN is defined as follows: where E denotes the expectation operator, p data (x) and p z (z) indicate the distributions of the real data and the noise, respectively. In the optimized procedure, D desires to distinguish as precisely as possible, that is, maximizing V (D, G). While G has the opposite objective, which attempts to fool D by generating real-like data to minimize V (D, G). The parameter updating of G relies on the backward propagation of D.
As one module is updated, the other is fixed, and they evolve alternately until their capabilities reach an equilibrium.

B. PROPOSED METHOD
Inspired by the adversarial training mechanism of GAN, this article proposes a TBGAN framework for the classification of ground objects by extracting the joint spectral-spatial features. Similar to the traditional GAN, TBGAN also consists of the generator and the discriminator. As can be seen from Fig. 2, there are two branches devised in TBGAN, which is specifically composed of three modules: the spectral generator G spec , the spatial generator G spat , and the discriminator D. To generate the corresponding virtual samples, two generators receive both noises and labels as input and learn spectral and spatial data distribution of real images respectively. D employs both real and virtual samples as the inputs, which aims to extract the joint spectral-spatial features and eventually achieve the classification task. Here the real spatial samples are the cubes cropped around each pixel upon the first three principal components through PCA transformation. By picking out only the first few dominant components, the spectral dimension can thus be reduced. It is worth noting that the intermediate layers of D are connected with their counterparts in two generators and the multiscale spectral/spatial features of real samples after downsampling. This kind of skip connection allows D to consider the multiscale features of both the real and the fake samples, thus enhancing its discriminative ability. Besides, such multiscale connections make the gradients be passed directly from D to the intermediate layers of two generators, which effectively avoids the circumstance of training instability caused by gradient accumulation in the previous GAN models.

1) SPECTRAL AND SPATIAL GENERATORS OF TBGAN
The generators G spec and G spat are employed to generate virtual samples containing spectral and spatial information, respectively. As shown in Fig. 3, the inputs of two generators are (z spec , y) and (z spat , y), where z spec and z spat represent the noise vectors and y denotes the one-hot coded labels. By concatenating labels and noises as input, the generators can learn the class-specific features during training, thus reducing the possibility of model collapse [31]. The virtual spectrums generated by G spec are depicted by G spec (z spec , y), while the virtual spatial patches generated by G spat are depicted as G spat (z spat , y). In addition, each virtual sample is assigned into class n + 1 (n is the number of dataset classes) and endowed with an artificial label y fake = 1 n (1, 1, . . . , 1) T n . G spec contains five 1-D transposed convolutional layers (1D-TConv), whose kernel size is 5. G spat is stacked by four 2D-TConv, and the kernel size of each layer is 3 × 3. Except for the last TConv layer that takes tanh as the activation function, each TConv layer in both generators utilizes the rectified linear units (ReLUs) as the nonlinear activation function and adopts batch normalization strategy.
However, existing models still find obstacles when capturing long-term dependencies across the spectral bands due to extensive bands in HSI [48]. Recently, the selfattention mechanism [49] has become a breakthrough with high hope to effectively address the above issue by obtaining global information of the feature maps through simple query and assignment operations [50]. Therefore, self-attention is drawn into G spec to calculate the response of all bands in the spectral sequence to a certain band. The self-attention layer is placed at the end of G spec , because the feature maps achieve the largest after five 1D-TConv operations, thus making the self-attention mechanism perform well.

2) DISCRIMINATOR OF TBGAN
In this article, a two-branch discriminator D is designed to fulfill the task of ground object classification by exploiting the joint spectral-spatial features. The architecture of discriminator D is depicted in Fig. 4. There are two sources of the input samples for D: one is the spectrums and spatial patches of the real images, denoted by (X spec , X spat ), and the other is the virtual samples generated by two generators, represented by (G spec (z spec , y), G spat (z spat , y)). In particular, each branch of the discriminator D consists of several Conv-Blocks, which can excavate the spectral or spatial features of the input samples. Besides, the pooling operation is replaced by the strided convolution in all Conv-Blocks so as to achieve the adaptive learning of downsampling. Fig. 5 exhibits the structure of Conv-Block in the spatial branch, which is nearly consistent with that in the spectral branch. For the input feature maps, its height and width are  labeled as 2w, and c is the number of channels. To obtain the spatial features, the Conv-Block first performs strided convolution to halve the size of feature maps and double the number of channels and then concatenates the handled feature maps with the multiscale features. These multiscale features consist of the intermediate layer outputs of the generator and the downsampled versions of the real data. After that, the concatenated feature maps are delivered into a convolution layer, whose kernel size is 3×3 (5×1 in the spectral branch). During this convolution, the size and quantity of feature maps remain unchanged. Finally, the number of channels is halved by further executing 1 × 1 convolution. After the implementation of four successive Conv-Blocks, the output spatial features are flattened as a one-dimensional vector. Similarly, the spectral features can also be flattened into a vector after adopting five successive Conv-Blocks in the spectral branch. By further concatenating these two vectors, the joint spectralspatial features are extracted. Then, these joint features are delivered into a softmax layer to achieve the classification of ground objects.
To avoid the network training instability caused by gradient accumulation, the intermediate layers of D are connected with their counterparts in two generators in a manner of combining 1×1 convolution with concatenation. For matching the size of the feature maps from the intermediate layers of generators, the real spectrums, and spatial patches are downsampled in an interlaced fashion.
Meanwhile, the quantity of the intermediate layer outputs in G spec and G spat are reduced by 1 × 1 convolution corresponding to different downsampled versions of the real samples. Subsequently, the acquired feature maps and the down-sampled real data are extended by 1 × 1 convolution respectively to produce the multiscale features, which have the same channels as the intermediate layer outputs in D. Finally, the multiscale features are concatenated with the counterparts from the intermediate layers in D and then delivered into the corresponding Conv-Blocks.
Here D(X spec , X spat ) and D(G spec (z spec , y), G spat (z spat , y)) denote the discriminant results of D for the real and virtual samples, respectively. Note that each convolutional layer in D employs the leaky-rectified linear units (Leaky-ReLUs) as the nonlinear activation function. Meanwhile, each layer applies the batch normalization strategy except for the input and the output layers.

3) LOSS FUNCTION OF TBGAN
The discriminator in the classical GAN utilizes the sigmoid classifier to distinguish whether an input is true or false, which pertains to binary classification. For the circumstance of multi-classification, the discriminator in ACGAN [51] method is imbued with a softmax classifier to undertake the multi-classification task. In recent years, to improve the adversarial training effects upon multi-classification, a multiclass adversarial strategy [37] is devised, which enables the softmax layer simultaneously complete the discrimination of input sources and the classification task. For this reason, this multi-class adversarial strategy is also introduced into TBGAN. Meanwhile, a feature-matching term is also added to the loss function, thus facilitating the generated samples preferably subject to the distribution of the real data. Consequently, the loss function of TBGAN can be defined as follows: where L G and L D represent the loss functions corresponding to the generators and discriminator, respectively. L c denotes the categorical loss of the virtual samples corresponding to the true labels y, and L s is the summation of the matching losses of the spectral and spatial features. The hyperparameter λ is utilized to trade off L c and L s . L real depicts the categorical loss of the real samples, and L fake represents the categorical loss of the virtual samples with the artificial label y fake = 1 n (1, 1, . . . , 1) T n . Specifically, these losses can be calculated as follows: (z spec , y), G spat (z spat , y)), y L s = f 1 (X spec ) − f 1 (G spec (z spec , y)) 2 2 + f 2 (X spat ) − f 2 (G spat (z spat , y)) 2 2 L real = CE D(X spec , X spat ), y L fake = CE D(G spec (z spec , y), G spat (z spat , y)), y fake (3) where CE(·) denotes the cross entropy, f 1 (x) and f 2 (x) depict the output of the flatten layers in the spectral and spatial branches of D, respectively. The objective of the generators is to make the discriminator distinguish the virtual samples as a certain class in the dataset and match the expected value of the features from flatten layers. Whereas D aims to furthest improve the multi-classification accuracy of the real samples and classify the virtual samples as the class of n + 1.
Besides, to alleviate the overconfidence of the discriminator, the labels in (3) can be smoothed complying with the strategy adopted in [52]. Concretely, by introducing a hyperparameter of ε, the elements of 0 and 1 in vector y are substituted with ε and 1-ε, respectively.

4) PROCEDURE OF TBGAN
As shown in Table 1, the specific procedure of the TBGAN method consists of the virtual sample generation, extracting the joint spectral-spatial features, and the ground object classification.

III. EXPERIMENTS
To demonstrate the classification performance of the proposed TBGAN, the experiments are conducted upon the Pavia University, the Salinas, and the Indian Pines dataset. In the experiments, 10% of the labeled samples are randomly selected for training, and the remainder is used for testing. Besides, class accuracy, average accuracy (AA), overall accuracy (OA), and Kappa coefficient are employed as indicators for measuring the classification results.  Table 2 shows the specific sample distribution on the Pavia University dataset used for training and testing.

2) SALINAS DATASET
The Salinas dataset was gathered by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Salinas Valley. The imaging wavelength range of AVIRIS is from 400 to 2500 nm, in which 204 bands are available after eliminating the bands absorbed by water. This dataset is in size of 512 × 217 pixels with a resolution of 3.7m. Among them, the labeled pixels are divided into 16 categories, including Fallow, Celery, Stubble, etc. Table 3 reports the number of samples used for training and testing.

3) INDIAN PINES DATASET
The Indian Pines dataset was also collected by AVIRIS sensors, with a size of 145 × 145 pixels. After eliminating 24 bands absorbed by water, 200 spectral bands are reserved. This dataset contains 10,249 labeled pixels, which are divided into 16 categories, including Alfalfa, Corn-notill, Corn-mintill, etc. Table 4 shows the number of samples used for training and testing.

B. EXPERIMENTAL SETTING
To evaluate the performance of the proposed TBGAN model, the experiments are designed comparative to six  representative HSI classification methods, including RBF-SVM [12], RF [9], LSTM [14], SSUN [14], M3D-DCNN [20], and DCGAN [35]. Meanwhile, the exploratory experiments are additionally conducted for the models of TBGAN containing the spectral branch or spatial branch only, which are named TB-SPE and TB-SPA, respectively. In the comparison models, the hyper-parameters, such as gamma and C in RBF-SVM and the number of decision trees in RF, are optimally sought out by grid-search, while the parameter configurations of other deep learning models comply with their sources. The detailed configurations of TBGAN are illustrated in Table 5 and Table 6, where Conv represents the convolutional layer, Tconv expresses the transposed convolutional layer, Atten represents the self-attention layer, and BN indicates the batch normalization. The learning rate of both the generators and the discriminator is set at 0.0002, the epoch is set to 500, and the batch size is 64. TBGAN adopts the Adam method [53] as the optimizer to adaptively adjust the learning rate. The dimensions of both z spec and z spat are set as 100, and the smoothing parameter ε is empirically set to 0.1. The parameter k is set to 3, indicating that, the discriminator will be updated three times when the generators are updated once. All hyper-parameters of TB-SPE and TB-SPA are configured the same as TBGAN.
Besides, the experiments are proceeded in the Pytorch backend with NVIDIA 1080Ti (number of cores: 1, RAM:11GB, Cuda version: 11.0). Since the models may be influenced by random initializations, the mean and standard deviation of classification results after ten runs are taken as the final experimental basis.

C. CLASSIFICATION RESULTS
Tables 7 -9 present the classification results of nine methods upon the Pavia University, the Salinas, and the Indian Pines dataset, respectively. Each table records, from top to bottom, the means of class accuracy, AA, OA, and Kappa coefficients after ten runs, as well as the standard deviations of the latter three evaluation metrics. As can be seen from Tables 7 -9, the deep learning methods generally behave better in classification performance than traditional machine learning methods by the exploitation of hierarchical features. Furthermore, the GAN-based methods can generate more training samples, which is very helpful for network training and makes them achieve higher accuracy compared with other deep learning methods. Among these GAN-based methods, TBGAN exceeds TB-SPE and TB-SPA, which strongly demonstrates the superiority of utilizing the joint spectral-spatial features. DCGAN and TB-SPA achieve encouraging results, which can be attributed to the utilization of PCA transformation, partly introducing spectral information. By virtue of the multiscale connections and the two-branch structure, TBGAN obtains the best classification results among these nine methods. For the Pavia University dataset, TBGAN increases by 5.67%, 1.78%, and 0.13% respectively in terms of the OA index compared with LSTM, M3D-DCNN, and SSUN. For the Salinas dataset, TBGAN attains the optimal class accuracy for 12 classes, 7 of which reach 100% and the 10 times average of OA reaches 99.98%. In addition, TBGAN also achieves surpassing performance on the imbalanced Indian Pines dataset. For example, the prediction accuracy of 96.80% is given for the Grass-pasture-mowed class in the scenario of only 3 training samples.
In addition to the quantitative comparisons of the classification results in Tables 7 -9, the qualitative visualization is also provided by creating the classification maps for each method on three HSI datasets. As exhibited in Fig. 6-8,  it can be observed that the classification maps obtained by the TBGAN are closer to the ground truths and have fewer outliers compared with other methods, which further confirms the effectiveness of the proposed method. Moreover, because the input of TB-SPA is a 3D cube in size of 47 × 47×3, its prediction may be interfered by the substantial spatial homogeneity, thus making the classification results   tend to be over-smoothed. Owing to the designed two-branch structure, TBGAN can extract the spectral information more thoroughly, which makes the classification results more refined compared with TB-SPA.

D. MODEL COMPLEXITY
To assess the complexity of the proposed TBGAN, Table 10 presents the number of parameters (Params) and floating-point operations (FLOPs) of seven deep learning methods. The results suggest that TBGAN has fewer parameters than advanced SSUN and DCGAN, but the actual computation is slower than that of other models due to the two-branch structure.
For a more comprehensive evaluation, the running time of nine methods upon each dataset is provided in Table 11. Generally speaking, shallow models in the machine learning community are more efficient than deep learning algorithms. More significantly, the four GAN-based models take more time during the training stage than other deep learning models because both the generator and the discriminator need to be trained simultaneously. In particular, the proposed TBGAN and its sub-models TB-SPE and TB-SPA all adopt such a training strategy of updating the discriminator three times while updating the generator once. TBGAN requires longer training time than TB-SPE and TB-SPA, this is probably because TBGAN needs to extract the joint spectral-spatial features and update the two generators in each training iteration.

IV. DISCUSSION
The relevant experiments are carried out for exploring the impacts of some significant influencing factors such as patch size, hyper-parameter λ, self-attention mechanism, and the number of training samples upon the model of TBGAN.

A. IMPACTS OF THE PATCH SIZES
Obviously, the performance of TBGAN is susceptible to the patch size. The larger patches may contain redundant information resulting in lower classification accuracy and heavier computation. In contrast, the smaller patches may provide insufficient spatial features for training the model, leading to false discriminants. In the experiments, four spatial neighborhoods of 31 × 31, 39 × 39, 47 × 47 and 55 × 55 are adopted, and the classification results on three datasets are presented in Table 12. For the Salinas and Indian Pines datasets, TBGAN achieves the best classification results both in the patch size of 47 × 47, while for the Pavia University dataset, the best OA of TBGAN corresponds to the patch size of 55 × 55. Thus, the patch size in the formal experiment is set to 47 × 47 by a majority of all datasets.

B. OPTIMAL CHOICE OF HYPER-PARAMETER λ IN L G
The hyper-parameter of λ in (2) is a weight factor to trade off L c and L s . To explore the influence of λ upon the classification results, the value of λ is selected from the range of [0, 0.5] at 0.1 intervals. As shown in Table 13, for the first two datasets, the overall accuracies of TBGAN performance are less affected by the parameter λ. However, for the Indian Pines dataset, the performance of TBGAN varies significantly with different values of λ and the best overall accuracy is acquired when λ equals 0.3. In view of this, the   value of hyper-parameter λ is set to 0.3 accordingly in the experiments.

C. ADVANTAGES OF SELF-ATTENTION MECHANISM
To capture the long-term dependencies in the spectral sequences, G spec is integrated with the self-attention mechanism, whose effectiveness is validated by training the TBGAN model with or without the self-attention, respectively. As depicted in Table 14, the classification performances of TBGAN are significantly improved upon the three datasets with the addition of the self-attention mechanism.

D. SENSITIVITY TO THE NUMBER OF TRAINING SAMPLES
To investigate the sensitivity of different classification methods to the number of training samples, 10%, 9%, 8%, 7%, and 6% of the labeled samples are successively picked out from the three datasets in the experiments. As shown in Fig. 9, with the reduction of training samples, the classification accuracy of all nine methods declines to varying degrees. As well known to all, deep learning methods require extensive training samples to optimize the parameters, and insufficient samples tend to result in overfitting of the model, thus reducing the classification accuracy. Whereas, the four GANbased models, by virtue of generating real-like samples, can alleviate the overfitting problem caused by the reduction of training samples. Specifically, as the ratio of training samples from three datasets decreases from 10% to 6%, the OA of TBGAN declines by 0.2%, 0.09%, and 2.3%, respectively, which are significantly slower than the other methods.

V. CONCLUSION
This article proposes a novel TBGAN model for HSI classification. Specifically, there are two generators devised in TBGAN to produce the spectral and spatial real-like data, respectively, which alleviates the small sample size problems. Furthermore, the spectral generator is integrated with the self-attention mechanism, ameliorating the manipulation ability of the long-term dependency relationship. For the multi-classification task, an elaborate discriminator with two branches is designed in TBGAN to extract the spectral and spatial features more thoroughly. It is particularly worth mentioning that the multiscale connections are placed between the discriminator and two generators in TBGAN to improve the network stability and the classification capability. Meanwhile, a feature-matching term is added to the loss function to make the training process more stable. The experimental results demonstrate that TBGAN behaves the superior classification performance and shows lower sensitivity to the number of training samples, which exerts great potential for classification under the circumstance of small size samples. In future research, more innovative strategies are highly expected to be developed in GAN-based supervised frameworks for further improving the performance of HSI classification.