ResBaGAN: A Residual Balancing GAN with Data Augmentation for Forest Mapping

Although deep learning techniques are known to achieve outstanding classification accuracies, remote sensing datasets often present limited labeled data and class imbalances, two challenges to attaining high levels of accuracy. In recent years, the GAN architecture has achieved great success as a data augmentation method, driving research toward further enhancements. This work presents ResBaGAN, a GAN-based method for the classification of remote sensing images, designed to overcome the challenges of data scarcity and class imbalances by constructing an advanced data augmentation framework. This framework builds upon a GAN architecture enhanced with an autoencoder initialization and class balancing properties, a superpixel-based sample extraction procedure with traditional augmentation techniques, and an improved residual network as classifier. Experiments were conducted on large, very high-resolution multispectral images of riparian forests in Galicia, Spain, with limited training data and strong class imbalances, comparing ResBaGAN to other machine learning methods such as simpler GANs. ResBaGAN achieved higher overall classification accuracies, particularly improving the accuracy of minority classes with F1-score enhancements reaching up to 22%.


I. INTRODUCTION
R EMOTE sensing is essential for monitoring the Earth's surface, aiding in tasks, such as tracking human-made constructions [1], detecting cropland changes [2], and studying ecosystems [3]. Accurate image classification methods are crucial for many of these applications, such as identifying areas invaded by non-native plant species for environmental monitoring. Unfortunately, remote sensing datasets often present limited labeled data and class imbalances (i.e., certain classes have significantly more samples than others), making it challenging to develop accurate classification methods [4]. These challenges become even more pronounced when processing data from multispectral sensors, given their limited spectral resolution [5]. It is crucial to develop advanced methods that optimally use the available data to overcome these challenges.
Neural networks have become increasingly prevalent in remote sensing applications owing to their exceptional performance. For instance, Zhu et al. [6] and Yuan et al. [7] This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
reviewed the application of different deep learning techniques in agriculture and environmental monitoring. The CNN stands out for its particularly well-suited image processing capabilities, which has led to its widespread application. For example, in Morales et al. [8], DeeplabV3+ CNNs were used to monitor the deforestation of the Mauritia flexuosa palm, a dominant species in the Amazon rainforest ecosystem, through high-resolution aerial RGB images acquired by UAV. In Hamdi et al. [9], a modified U-Net CNN was implemented to automatically detect and map the damaged areas in a forested area in Bavaria, Germany, by using aerial photographs with four spectral bands. The U-Net architecture was also used in Isaienkov et al. [10] for deforestation detection in the forest-steppe zone in the Kharkiv region of Ukraine, by using multispectral images of Sentinel-2. Finally, a region-based Mask CNN was used in Chiang et al. [11] for an automated forest health diagnosis in RGB aerial images from the Wood of Cree in Scotland, aiding in the early detection of dead trees.
Over the past few years, remote sensing research has developed more sophisticated deep learning architectures to better utilize the available training data. The CNN design has progressed from shallow architectures with few layers [12] to deeper, more powerful ones with tens or hundreds of layers, commonly known as residual networks [13], [14], [15], [16], which can be recognized by their innovative integration of skip-connections between layers. Recent advances have also introduced new components to CNNs. For example, Li et al. [17] included attention mechanisms into a simple CNN to extract richer spatial and spectral information. Another example is Hong et al. [18], which combined a CNN with a GCN to model relationships between training samples, enhancing the classification performance. Researchers have also explored deep learning architectures beyond CNNs. For instance, He et al. [19] leveraged the transformer architecture [20] to assimilate high volumes of data, advancing the state-of-the-art in the multimodal semantic segmentation of images. Another promising research direction is using multimodal data from various sensors for complex scene analysis. For instance, Hong et al. [21], [22] studied the use of dual-branch CNN architectures to combine features from two different modalities (e.g., hyperspectral and LiDAR). Furthermore, designing self-supervised architectures that can learn from both unlabeled and labeled data is another promising area. For instance, Sun et al. [23] developed a general-purpose model by analyzing millions of unlabeled scenes with a transformerbacked autoencoder architecture, acquiring extensive remote sensing knowledge that can be fine-tuned for a specific task, outperforming various state-of-the-art architectures.
Data augmentation provides another promising solution to address data scarcity and class imbalance constraints. These techniques can assist advanced deep learning classification architectures by synthetically generating new training samples to enrich the learning data. Numerous data augmentation techniques exist [24] and have been applied to remote sensing. The traditional approaches involve transformations that convert a sample into a new one and techniques that combine different samples of the same class to create new ones. For instance, Haut et al. [25] showed the effectiveness of augmentation through the random deletion of input patch segments, while Acción et al. [26] subdivided each patch and applied independent transformations to each segment. In Nalepa et al. [27], new samples were generated from the first principal component of the dataset or by calculating the mean value of each band. More novel augmentation approaches use generative techniques that synthesize samples from scratch after estimating the data distribution, as opposed to transforming existing samples into new ones.
Currently, the most widely adopted data generative approach is the GAN [28]. This deep learning-based augmentation approach has shown great potential compared to traditional techniques [29], owing to its remarkable capacity to generate highly realistic and diverse data from scratch, outperforming earlier generative architectures, such as the RBM [30] and the VAE [31]. Further advancements in the GAN architecture have also enabled its use as a classifier network, establishing it as a valuable tool for remote sensing applications like environmental monitoring. For instance, in Shashank et al. [32], a GAN was used to identify the target epiphyte Werauhia kupperiana in RGB images acquired by UAV in Costa Rican forests. Other interesting scenarios of GANs for environmental monitoring are described in [6] and [33]. Unfortunately, successfully training these networks is challenging unless large datasets are available [34]. Second, class imbalances hinder the learning of minority classes, even preventing it entirely. In such cases, GANs tend to synthesize identical samples for these classes or even fail to capture their data distribution, resulting in pure noise [35]. However, a recent development in GANs, the BAGAN [35], provides a promising solution to enable the successful application of GANs when facing these two challenges.
This work presents ResBaGAN, a novel GAN-based classification method for remote sensing images applied to environmental monitoring. The ResBaGAN architecture overcomes the challenges of data scarcity and class imbalances by constructing an advanced data augmentation framework to support the classifier. The framework builds upon the cutting-edge innovations of BAGAN to provide effective GAN-based data augmentation. In addition, the framework leverages a superpixel segmentation-based sample extraction process with traditional augmentation techniques, and a ResNet-based classifier [13] improved to further enhance its capabilities. Experiments were conducted to demonstrate how the combined synergistic effects of all components significantly improved classification performance, particularly in the minority classes, compared to simpler methods like a stand-alone BAGAN. More specifically: 1) The proposed method integrates a GAN architecture inspired by BAGAN to enable effective deep learning-based data augmentation for scarce datasets and their minority classes.
2) The sample extraction process is guided by a superpixel segmentation and includes traditional augmentation techniques, thereby facilitating the more accurate modeling of the different classes. The segmentation ensures that each sample predominantly contains pixels from a single class, while the additional augmentation techniques increase the diversity of learning data. 3) The classifier features a ResNet-based design improved to integrate data from different levels of abstraction for better generalization. This architecture enhances both the quality of the GAN-based augmentation and the final classification accuracy of ResBaGAN. 4) In terms of evaluation, the FID score [36], which is the current standard metric for GAN-based architectures on RGB data, was adapted to multispectral images. This modification allowed for an accurate assessment of the proposed deep learning-based data augmentation. The rest of this article is organized as follows. Section II outlines the GAN architecture and its evolutions. Then, Section III describes ResBaGAN, detailing its different components. Thereafter, Section IV presents the experiments for evaluating ResBaGAN in terms of classification performance and robustness. Next, Section V carries out the discussion of this work. Finally, Section VI concludes this article.

II. RELATED WORK
A GAN [28] is a generative deep learning architecture that consists of the two networks represented in Fig. 1(a): a generator G and a discriminator D. These networks are trained in an adversarial fashion: G learns to synthesize samples as realistically as possible to deceive D into believing that they are real, while D is trained to distinguish between real and fake samples. As a result of this opposition of objectives, the improvement of one network encourages the other to perform better.
Both D and G use CNN architectures. D is a standard CNN, while G replaces conventional convolutions with transposed convolutions, which function inversely. D needs to transform a sample into a unidimensional feature vector of length Z for the final classification. In contrast, G aims to generate data akin to the input of D. To achieve this, it starts with a random vector of Z elements, often referred to as a latent vector, reshaped into a low-resolution sample. The size of this sample is gradually increased by applying transposed convolutions to produce a fake sample.
Building upon the initial GAN architecture in Fig. 1(a), numerous optimizations have been proposed in recent years to further extend its capabilities [37]. The following are particularly relevant to this work: 1) CGAN [38]. In the original GAN, G cannot synthesize samples for a specific class on demand. The CGAN architecture, shown in Fig. 1(b), addressed this limitation by incorporating information corresponding to the desired class in the latent vector. 2) ACGAN [39]. A natural extension of CGAN is to enable D to assign each sample to the best-fitting class, in addition to discerning between real and fake samples. To this end, D in ACGAN has two outputs as shown in Fig. 1 binary output for detecting fake samples and an output with as many elements as classes. 3) BAGAN [35]. This architecture introduced several modifications to stabilize training with small datasets and effectively synthesize the minority classes. First, the networks' parameters are initialized by an autoencoder that processes Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
the dataset in advance, instead of from scratch. As a result, both D and G start learning from an approximation of the data distribution, significantly stabilizing their training on small datasets. In addition, a more sophisticated sampling strategy for the latent vectors is introduced by modeling class-specific distributions in the latent space provided by the autoencoder, which further stabilizes the training of G. Moreover, the dual outputs of D are combined into a single output, as shown in Fig. 1(d). By extension, the loss function used in the training process becomes simpler, making it easier for G to learn to synthesize the minority classes. As discussed in Section I, the evolving capabilities of GANs have established them as valuable tools for data augmentation in remote sensing applications. In particular, BAGAN has shown promise in different computer vision tasks to address the challenges of scarce and imbalanced datasets, two common limitations in remote sensing. However, to the best of our knowledge, the use of BAGAN in remote sensing has been limited to object detection, as demonstrated in Zhang et al. [40].
The method proposed in this article, ResBaGAN, is a GAN-based classification approach inspired by innovations in BAGAN, among other features. We not only aim to expand its usage in remote sensing, particularly for classification applied to environmental monitoring, but also enhance its performance further. This is achieved by incorporating traditional augmentation techniques, superpixel segmentation-based sample extraction, and an improved ResNet-based classifier.

III. PROPOSED METHOD
ResBaGAN, the proposed method for remote sensing classification applied to environmental monitoring, is illustrated in Figs. 2 and 3. This section describes its different components as follows. First, the GAN architecture designed for deep learning-based data augmentation is discussed in Section III-A. Next, the sample extraction procedure that combines superpixel segmentation and traditional augmentation techniques is introduced in Section III-B. Then, following these explanations, a complete step-by-step explanation of how ResBaGAN operates is provided. Finally, the design of the neural architectures in ResBaGAN is detailed in Section III-C.

A. Network Architecture
The core of ResBaGAN is a BAGAN-based architecture that includes an improved ResNet-based classifier [13], resulting in a combination that enhances both the quality of the GANbased augmentation and the final classification accuracy. The integration of shortcuts in residual networks allows building much deeper architectures compared to stacking convolutional layers alone, circumventing training issues like gradient vanishing. This makes ResNet more suitable than shallow CNNs for classification with limited data, as it facilitates the extraction of more detailed features from the available samples. Moreover, the generalization capabilities of the designed ResNet architecture have been improved by fusing features from every convolutional stage into the final classification layers, instead of using features just from the last stage. Furthermore, the designed GAN module leverages two BAGAN features to assist the generator network in learning: autoencoder initialization and an improved loss function. This approach leads to the synthesis of more realistic samples when dealing with scarce and imbalanced datasets, thus providing an effective data augmentation for such situations.
As shown in Fig. 2, three main elements can then be identified in the network architecture of ResBaGAN: 1) The input data are patches from the different classes, as displayed on the left side of the figure. The specialized sample extraction procedure provides this augmented training data and will be detailed in Section III-B. 2) ResBaGAN leverages an autoencoder module [41] to stabilize the subsequent training of the GAN on limited data. The autoencoder is depicted in the upper part of the figure.
Autoencoders are easier to train than GANs under data scarcity, but they do not assign classes to the acquired knowledge, making them unsuitable for classification. However, they can help in initializing the weights of the GAN.
The autoencoder must learn to compress and reconstruct samples as accurately as possible, encouraging it to acquire an initial understanding of the dataset. The autoencoder consists of an encoder (E in the figure) and a decoder (Δ). As the encoding and decoding steps resemble the tasks of the discriminator (D) and the generator (G), respectively, weight-sharing by using the same network topologies allows transferring the knowledge of the trained autoencoder to the uninitialized GAN, thus complementing each other's strengths and weaknesses.
3) The main networks of ResBaGAN are D and G, collectively referred to as the GAN module in the figure.
Once initialized from the autoencoder, these networks are trained in an adversarial manner, so that D learns on both the real samples and the samples that G synthesizes.
To facilitate G the learning of minority classes, D combines its real/fake prediction and class prediction in a single output. This output contains N + 1 elements, where N denotes the number of possible classes, and samples suspected as fake are assigned the additional class. The specific network topologies of the D and G networks (and thus E and Δ) in ResBaGAN will be detailed in Section III-C. Therefore, the ResBaGAN training process involves three steps: 1) Learning of the autoencoder module: The autoencoder trains on all of the available samples without considering their labels, to produce the initial understanding of the data distribution. L2 loss, shown as the dotted line at the top of the autoencoder, guides the training process. This loss is calculated as the mean squared difference between the input and the reconstructed samples, which the autoencoder aims to minimize. Given a reconstructed sampleŷ and its target y, the L2 loss is as follows: 2) GAN initialization: All of the learned parameters are transferred from the autoencoder to the GAN networks, enabling them to acquire the knowledge of the autoencoder. This is shown in the rightmost line of the figure. 3) Learning of the GAN module: Owing to the use of the transferred weights as values, the training starts from a more stable point than randomly initialized parameters, as the GAN refines the initial understanding of data. Categorical cross-entropy loss, represented by the dotted line at the bottom of the Fig. 2, guides this learning process. This loss is calculated as the difference between the predicted and the expected class-probability distributions for a sample. Given a predicted distributionŷ, obtained by applying the softmax function to D's raw output, and a one-hot encoded class target y, the cross-entropy loss is as follows: where y c denotes the target probability of the sample belonging to class c. Both D and G strive to minimize their respective losses, L D and L G , which are based on the cross-entropy loss. As D wants to accurately classify the real samples and assign the fake label to the samples from G, its loss can be split into a real part and a fake part: The real part is obtained by applying the cross-entropy loss between the predicted class probabilities of the real samples and their reference labels, and the fake part of the loss is obtained by applying the cross-entropy loss between the predicted probabilities of synthetic samples and the fake label. In contrast, G's objective is to ensure that D assigns its generated fake samples to the class that they intend to represent. To do so, the cross-entropy loss is applied between the predicted probabilities of the synthetic samples and their intended labels. Training proceeds by alternately optimizing the discriminator and the generator, which can be summarized in a minimax manner as follows: The first term corresponds to the probability of all real samples x, drawn from the p data (x) data distribution, being correctly classified according to their labels y. The second term represents the probability of all synthetic samples being classified as the fake class N + 1; z is a latent vector drawn from the latent space distribution p z (z), which is conditioned by the intended class y in G(z, y) to produce a synthetic sample. Once all of the neural networks in ResBaGAN have been trained, the fake output of D is disabled, yielding the final classifier.

B. Sample Extraction via Superpixel Segmentation and Traditional Augmentation
To further improve ResBaGAN in handling scarce and imbalanced datasets, a specialized procedure for extracting samples from the dataset is introduced. This procedure leverages two extensively studied techniques in remote sensing: superpixel segmentation and traditional data augmentation. Fig. 3 illustrates this procedure step by step, yielding the training samples fed to networks described in Section III-A.
A superpixel segmentation groups similar pixels in an image into homogeneous and contiguous regions called superpixels. These regions have a relatively constant size but not a specific shape, as they adapt to the objects in the scene [42]. Thus, superpixels can be used as larger pixels for image processing and have been widely used in remote sensing classification [3], [26], [43], [44]. Instead of extracting a patch for each pixel in a sliding-window fashion [45], one patch per superpixel can be extracted, assigning the predicted class to all of its pixels. If the patch size is adjusted to fit within the average size of the superpixels, patches centered on the superpixels predominantly contain pixels of a single class. This approach helps avoid the Generate synthetic_sample with G Add synthetic_sample to fake_samples end while Add fake_samples to batch return batch end function noise introduced in learning data by patches containing multiple classes, particularly at the object edges, thus facilitating the modeling of the different classes for neural networks.
On this basis, ResBaGAN extracts samples from the dataset by using a superpixel segmentation as guidance, as shown in Fig.  3 and detailed in Algorithm 1. Moreover, traditional data augmentation techniques are applied to all of the training samples, as described in Algorithm 2.
Regarding the superpixel segmentation algorithm, numerous options exist in the literature, such as SLIC [46], ETPS [47], and  [48]. Typically, these algorithms are based on gradient descent or graphs [42], with the latter being more computationally expensive and lacking control of the segment size or regularity [46]. WP [49] is a popular gradient descent algorithm that provides good-quality segmentations with moderate computational consumption [42], [46], and allows for the customization of the segment size and regularity. Therefore, WP is chosen as the superpixel segmentation algorithm for ResBaGAN.
As all ResBaGAN components are explained, its step-by-step operation can be summarized as Algorithm 3.

C. Network Topologies
As explained in Section III-A, the autoencoder and the GAN modules share network topologies. First, the topology of the residual discriminator (D in Fig. 2) will be discussed, which is also applicable to the encoder (E). Then, the design of the generator (G) will follow, which is also applicable to the decoder (Δ).
The ResBaGAN classifier (D) features a ResNet-based topology, with the novelty of integrating data from different levels of abstraction to enhance generalization, as it will be detailed in the following paragraph. This architecture is depicted in Fig. 4. When compared to shallow CNNs, a residual approach allows for more efficient use of the available learning data and pushes the capabilities of G further owing to the adversarial nature of GANs.
More specifically, the classifier consists of three residual stages (colored differently in the figure), each containing three residual blocks that use the same number of convolutional filters.  The growing filters allow for extracting increasingly complex visual features. Additional convolutional layers precede each stage to ensure appropriate input dimensionality, and the first convolutional layer in each stage has a stride of 2 for learned dimensionality reduction. In contrast to the original ResNet architecture, this improved design fuses the features extracted from the last stage with those derived from all preceding stages, by adding the corresponding feature maps. This approach merges data from the low-, mid-, and high-level abstractions into the input for the final layers, leading to better generalization. Two additional convolutional layers map the dimensionality of the outputs from Stage 1 and Stage 2 to Stage 3 to perform the addition. A global average-pooling layer transforms the fused feature maps into a vector of features processed by a fully connected layer for the final classification.
The details of the classifier's layers are presented in Table I. For brevity, the table omits the convolutional layers adapting the input dimensionality between stages and those adapting the output from the first two stages for feature fusion; they have 3 × 3 filters and use the same activation function as the other layers. As usual in residual networks, a dropout layer follows each convolutional layer to stabilize training and address problems such as overfitting [50]. The selected dropout probability and activation function will be explained in Section IV-A4. Spectral normalization [51] is also applied to all convolutional layers, as it commonly improves GAN networks.
In the design of ResBaGAN, achieving an equilibrium between G and D is crucial for stable learning and thus, sustained adversarial behavior. This balance enables both networks to progress together, rather than allowing one to outperform the other significantly, disrupting the adversarial learning and impeding their ability to compete. Such disruption would prevent G from learning to perform a deep learning-based augmentation that is useful to D. Thus, a standard GAN generator has been found to be an ideal counterpart for the residual classifier in ResBaGAN.
As depicted in Fig. 5, the generator consists of a series of transposed convolutional layers. The initial embedding layer [52] transforms the desired class into a Z-element vector, combined with the latent vector via a dot product operation. The result, interpreted as a patch with the dimensions N × N × B = 1 × 1 × Z, is supplied to the transposed convolutions to generate the synthetic sample by gradually increasing its resolution.
The details of the generator's layers are presented in Table II. Notably, the choice of Z can significantly impact the synthesis performance. Short latent vectors may struggle to create realistic samples, whereas large vectors could be too difficult to handle effectively. Hence, the impact of Z will be examined in Section IV-A4. The activation function also follows the rationale in Section IV-A4. The only exception is the activation of the last layer, which uses a tangent function, as all the dataset is scaled to the [−1, 1] range before being input to ResBaGAN. Spectral normalization is also applied to all of the convolutional layers.

IV. EXPERIMENTS
In this section, ResBaGAN is evaluated in terms of classification performance to assess both its general effectiveness and the contributions of its individual components, such as the synthesis quality of the GAN-based augmentation. ResBaGAN's capabilities are compared with simpler classification approaches, such as standalone CNNs and other GAN-based approaches like ACGAN.
The section is organized as follows. First, the datasets, metrics, and experimental environment used for the assessment are outlined in Section IV-A. A range of design choices that were left open in Section III are also addressed, such as the selection of activation functions, the optimizer, and the number of training epochs, among other factors. Then, the experimental results are presented and analyzed in Section IV-B.
A. Experimental Setup 1) Datasets: Eight large, very high-resolution multispectral images of natural regions with dense vegetation were used [3]. These images were captured in 2018, 2019, and 2020, by flying a UAV at a 120 m altitude over various river basins in Galicia, Spain, resulting in a spatial resolution of 10 cm/px. The UAV carried a MicaSense RedEdge-MX multispectral camera, capturing five spectral bands corresponding to wavelengths of 475 nm (blue), 560 nm (green), 668 nm (red), 717 nm (red edge), and 842 nm (near-infrared). Table III details the specific locations and dimensions of the scenes.
The composite color image and the reference data for each dataset can be found in Fig. 6. Table IV enumerates the ten identifiable classes in the reference data, detailing the number of samples in each dataset. These classes range from native vegetation to human-made structures such as roads or buildings. It is important to highlight the strong imbalances between classes in all datasets, with minority classes having multiple orders of magnitude fewer samples at times. This imbalance introduces a bias toward majority classes, potentially preventing a balanced classification accuracy across the different classes.
All of the datasets were segmented using the WP algorithm by choosing an average size of 400 px/superpixel, allowing a minimum size of 100 px/superpixel, and employing a compactness factor of 0.5 points, following the approach in [3]. The extracted patches had spatial dimensions of N × N = 32 × 32 px. In addition, all the data were normalized to the [−1, 1] range. For a given dataset, all associated experiments used the same subset of 15% training samples, and 5% validation samples to monitor the training progress by identifying potential issues like overfitting. This limited availability of learning data is enforced Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Fig. 6. Composite color images (left) and reference data (right) for the datasets used in this work [3]. All representations follow the same size scale. The class corresponding to each color in the reference data is described in Table IV; black means no reference data are available.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. to make training difficult for neural networks, as allocating 15% of training often yields less than 100 reference samples for many classes, particularly impacting the minority ones.
2) Metrics: The classification performance of ResBaGAN was determined by predicting the class of each available labeled sample in the dataset and comparing the results to the reference data. The following standard pixel-level metrics in remote sensing classification [53] were computed, excluding only the central pixels from the training samples as described in [3]: OA, AA, and Cohen's Kappa (κ).
To understand the performance of ResBaGAN across majority and minority classes, the classification performance was further examined with the following classwise metrics [53]: PA, UA, and F1. PA indicates the probability that a pixel belonging to a specific class was correctly classified, whereas UA denotes the probability that a pixel classified as a certain class indeed belonged to that class. F1 is the harmonic mean of PA and UA, providing a summary of classification accuracy for individual classes.
To also assess the synthesis quality of ResBaGAN, the FID score [36] was adapted to use the full spectral resolution of the datasets. This score is a widely accepted metric for determining the quality of GAN-generated samples in RGB data. However, we have not found a standard procedure for its application in the remote sensing field.
The FID score uses a classifier network to determine the similarity between the data distribution of the synthesized samples and that of real samples. This metric leverages a pretrained Inception v3 architecture to summarize the samples into the visual features that are compared. However, as this RGB network cannot fully use the spectral resolution of remote sensing data for an accurate assessment [54], [55], it was replaced by the improved ResNet from Table I to use the full spectral resolution of the samples.
To calculate the FID score of ResBaGAN on a particular dataset, all available real samples for each class were gathered and compared with an equal number of randomly synthesized samples from G for the corresponding class. The FID score comparison was run on the features extracted by the average pooling layer of the improved ResNet. This network had been pretrained for the corresponding dataset in a standalone manner, to prevent introducing biases from the ResBaGAN architecture.
Finally, to account for the inherent randomness in the experiments due to factors such as the random parameter initialization in neural architectures, all of the experiments were repeated ten times under identical conditions, reporting metrics summaries such as mean values and 95% confidence intervals.
3) Execution Environment: All of the experiments were conducted on a computer equipped with an Intel Core i7-11700 K CPU [56], 128 GB of RAM, and an NVIDIA RTX 3080 Ti GPU with 12 GB of VRAM [57]. The system ran Ubuntu 20.04 [58], with most of the source developed using Python 3.8.13 [59] and PyTorch 1.12.0 [60] for neural architectures. CUDA 11.6.2 [61] and cuDNN 8.4.0.27 [62] were leveraged to speed up execution by using single precision arithmetic. Some auxiliary codes, such as WP segmentation, were developed using C [63] and C++ [64], compiled with GCC 9.4.0 [65]. The tool GuildAI [66] was also used for automating the hyperparameter optimization step. 4) Hyperparameter Optimization: Several design aspects of ResBaGAN were left open in Section III, such as the activation function and learning rate. To determine these factors, a hyperparameter optimization procedure was conducted to evaluate ResBaGAN's performance under many scenarios and identify the optimal design choices. In particular, the following configurations were examined: r Activation function: ELU [67], LeakyReLU [68], and PReLU [69] were tested, as they generally outperform traditional options like ReLU [70].   Once all of these tests were run, the results were filtered based on the best classification accuracy and synthesis quality of the GAN-based augmentation. Specifically, the aim was the highest classification accuracies, as determined by OA, AA, and κ, and the lowest training losses in G. Since D and G were trained in an adversarial manner, the lowest loss for G and around 50% error rate for D indicate that G synthesized samples that were so realistic that D relied on chance to tell them apart from real ones.
Considering these factors, the design choices that maximized the performance of ResBaGAN were the LeakyReLU activation function, latent vectors of size Z = 128, 5% dropout probability, learning rates of around 0.001, and a batch size of 32 samples.

5) On Training Neural Architectures:
The ADAM optimizer [71] was employed to train any neural architecture, given its strong performance with deep and complex architectures. Its parameters were set to β 1 = 0.5 and β 2 = 0.999, as usual for GAN-based architectures [72]. All of the parameters in the neural architectures were initialized using Xavier's (or Glorot's) Initialization [73]. Regarding the number of training epochs, different options were tested while monitoring the loss of networks in ResBaGAN throughout the process. The conclusion was that 600 training epochs were sufficient for both D and G to reach peak performance in the datasets. To ensure a fair comparison, 600 training epochs were also provided to the remaining architectures, such as the autoencoder inside Res-BaGAN or standalone CNNs. Validation accuracy monitoring was performed as needed to ensure that all classification approaches included in these experiments, in addition to ResBa-GAN, experienced stable training without facing problems such as overfitting. Moreover, when needed, the same hyperparameter values as in ResBaGAN were applied to the other classification methods.

1) Stability of Adversarial Training:
To ensure sustained adversarial learning, the evolution of various training metrics for both D and G is analyzed in an experiment with ResBaGAN on the Ermidas Creek dataset, as depicted in Fig. 7(a) to (c). Fig. 7(a) shows the evolution of G over all 600 training epochs. Its performance increased until around 200 iterations before stabilizing. The remaining iterations still contributed to the final performance of ResBaGAN given that D persistently lowered its training loss, as Fig. 7(b) shows. However, the continuous loss reduction in D could also be a symptom of overfitting. In this case, Fig. 7(c) reveals that the generalization consistently improved until approximately iteration 500 before stabilizing for the remaining 100 iterations. Overall, these findings demonstrate that executing 600 training epochs enabled ResBaGAN to optimize its performance for the experimental datasets.
2) Synthesis Performance: The synthesizing capabilities of ResBaGAN are compared with two simpler approaches: an AC-GAN and a BAGAN-based architectures that assist the shallow CNN described in Table V, while using the same G network as in ResBaGAN. This BAGAN is a stripped-down version of ResBaGAN, lacking the residual classifier and support from traditional data augmentation techniques. ACGAN can be seen as an even simpler version of BAGAN, missing the autoencoder and the improved loss function for training.  Fig. 8 displays synthesized samples for the Ermidas Creek dataset using these three GAN-based approaches. It is evident that the G network of ACGAN struggled to model the different classes. This resulted in fake samples that primarily resemble different types of vegetation, which are the majority classes in the dataset, and exhibit limited variation. In contrast, classbalancing enhancements allowed BAGAN-based approaches to perform proper data augmentation. As illustrated in Fig. 8(b) and (c), they could adequately generate fake samples for all classes, including minority ones. Furthermore, increased visual variations are present between fake samples within the same class, particularly in ResBaGAN.
The synthesis performances can also be numerically assessed using the FID score. Following the approach described in Section IV-A2, the FID scores for experiments on the Oitavén River dataset were measured and are displayed in Fig. 9 and Table VI. Fig. 9 displays the score for each class, while Table VI summarizes the FID into an averaged-FID and a weighted-FID for each GAN. The former metric averages the score among classes, whereas the latter assigns greater importance to classes with more elements.
Compared to the other approaches, the high FID scores of ACGAN reflect its limited ability to synthesize varied samples. In contrast, BAGAN-based approaches achieved significantly better FID scores, particularly in classes 1 to 6 and 9, which contain fewer samples in the Oitavén River dataset. ResBaGAN's FID scores were either comparable to or better than BAGAN's, as summarized in Table VI. The particularly enhanced averaged-FID for ResBaGAN suggests it mainly outperformed BAGAN in classes with fewer samples. We attribute this superior synthesis performance to the more complex D network in ResBaGAN compared to the shallow CNN in BAGAN; a more robust D enables G to outperform itself even further, owing to the adversarial nature of the GAN training.
3) Overall Classification Performance: Table VII presents the resulting OA, AA, and κ accuracy metrics for ResBaGAN  and other classification approaches on all experimental datasets. To facilitate visual comparison of the classifiers, graphs for each metric are also provided in Figs. 10-12. These graphs display the mean values and confidence intervals from the table for the top four classifiers.
The additional classification approaches compared include a traditional machine learning approach, two standalone CNNs, and the ACGAN and BAGAN described in Section IV-B2. The texture-based traditional method, proposed in [3] for the classification of the same datasets, served as a baseline for non-deep learning approaches. It employs the FV algorithm [74] for texture extraction and KELM [75] for the final classification. The standalone CNNs included the shallow CNN described in Table V, and the improved ResNet described in Table I, which also served as the classifier for FID score assessment.
First, the results reveal that the shallow CNN failed to achieve better classification accuracy than the traditional approach in most cases. These experiments demonstrate that carefully crafted traditional machine learning approaches can still outperform certain modern techniques. That said, the standalone improved ResNet significantly outperformed the traditional method. This is due to the increased potential of using a very deep convolutional architecture rather than a shallow one, to better utilize the limited available learning data.
Examining the results for ACGAN, its classification accuracies were not always better than those of the shallow CNN, even though ACGAN used it as D. These results are consistent with the analysis in Section IV-B2, as ACGAN struggled to properly synthesize fake samples, leading to confusion for the classifier and ultimately damaging its performance. In contrast, BAGAN's enhancements enabled effective data augmentation.  Despite both BAGAN and ACGAN sharing the topology of D, the classification performance of BAGAN was significantly higher than the shallow CNN in almost every case. More specifically, the most significant gains were in AA, while OA and κ had more subtle improvements. This suggests that GAN-based data augmentation primarily enhanced the accuracy in the minority classes, which should be more challenging to learn for the classifier. Although both the improved ResNet and the BAGAN provided impressive results, ResBaGAN, the proposed classification method in this work, managed to outperform them. ResBaGAN achieved the highest and more consistent values for all the classification metrics in seven out of eight datasets, ranking among the best in the remaining one. Moreover, the confidence intervals reveal the substantial significance of Res-BaGAN's enhancements in nearly all cases, as its potential minimum values consistently match or exceed the possible maximum values achieved by the other classification approaches.

4) Classwise Classification Performance:
To better understand the contributions of ResBaGAN in dealing with scarce and imbalanced datasets, Table VIII presents the mean PA, UA, and F1 for ResBaGAN and the other approaches developed in this work on all the classes of elements, averaging results across datasets. A graphical depiction of F1 score is also provided in Fig. 13 for easy comparison.
All class-specific findings align with the observations in Section IV-B3. Among the standalone CNNs, the improved ResNet significantly outperformed the shallow CNN in all cases, which is expected due to its more complex network topology.
Using the F1 score to summarize class-specific accuracies, ACGAN deteriorated the overall performance of CNN. This occurred due to a slight increase in PA but a considerable decrease in UA, particularly for non-vegetation classes. This indicates that ACGAN produced numerous false positives, particularly for minority classes. This observation aligns with the fact that ACGAN's G generated synthetic samples resembling vegetation regardless of the target class, leading to a confused D. Consequently, this network classified many more samples as minority classes, leading to occasional additional hits, but also a significant increase in overconfidence and false positives. In contrast, the enhancements in BAGAN provided an effective GAN-based augmentation for the CNN, significantly increasing all metrics to match the performance of the improved ResNet. BAGAN and the ResNet showed varied performance, with one method slightly outperforming the other at times, as summarized by the F1 score. Interestingly, the improved ResNet tended to improve PA the most, achieving better values than BAGAN in six out of ten classes, while BAGAN was more effective in increasing UA, with better values in six out of ten classes.
Finally, ResBaGAN further improved performance across all classes, achieving the best PA, UA, and F1 scores in almost every case. These enhancements were particularly noticeable in the minority classes. For example, concrete is consistently one of the scarcest classes, and ResBaGAN boosted its F1 by 22%. The solid performance of ResBaGAN when facing scarce and imbalanced data is further exemplified by the confidence intervals for F1 scores. Even at its lowest potential, ResBaGAN consistently obtains competitive results compared to the improved ResNet and BAGAN. This behavior is further reiterated in the aggregated confusion matrix for ResBaGAN across all ten experiments for each dataset, depicted in Table IX. The strong diagonal dominance indicates high classification accuracy across all classes. r Replacing the residual topology in D with the shallow CNN described in Table V. r Removing the traditional augmentation techniques applied to training samples.
r Removing the superpixel segmentation-guidance from the sample extraction procedure, to instead extract one patch per pixel in a sliding window manner. The training used the same number of samples as with superpixels. The results of the ablation experiments are represented in Fig. 14. It is evident that removing any of ResBaGAN's components significantly impacted its performance, as the potential maximum accuracies are always lower than, or comparable at most, to the minimum expected accuracies in the baseline ResBaGAN.  Interestingly, AA was the most affected accuracy metric. Given that it assigns equal weights to minority and majority classes, this observation highlights the particular importance of all ResBaGAN components in developing an effective data augmentation framework to adequately assist the classifier in learning minority classes with limited data. 6) Computational Cost: Finally, the computational cost of the various deep learning methods tested in this work is compared. The cost is presented in Table X as averaged speedup values for the training and classification phases across all experiments, using the shallow CNN as a baseline.
First, the improved ResNet required approximately twice as much computation time as the shallow CNN during both phases. This is expected due to the much larger neural topology. In contrast, ACGAN also required twice as much training time compared to the CNN, as it uses this network as D and a companion G network of similar complexity. However, the classification time for ACGAN was only slightly higher than the CNN, as G did not play a role in this phase. We attribute the minor slowdown to the PyTorch overhead of having both networks on the GPU. BAGAN further increased the training time by requiring the learning of the autoencoder before the GAN networks themselves. However, as the autoencoder did not play a role during classification, the inference time remained roughly the same as that of ACGAN.
Finally, ResBaGAN exhibits increased complexity due to combining features from both the improved ResNet and the BAGAN, in addition to the traditional augmentation techniques, resulting in the longest training times, and slightly slower classification compared to the improved ResNet. We attribute this slight difference again to the PyTorch overhead of having multiple networks on the GPU.

V. DISCUSSION
Enhancing the accuracy of remote sensing image classification is crucial for numerous monitoring tasks of the Earth's surface. Although deep learning techniques are known for their excellent classification performance, they usually require large amounts of data to unleash their full potential. This can be especially challenging for remote sensing applications, as datasets often have limited labeled data and class imbalances [4]. Data augmentation techniques, particularly those based on GAN architectures [28], present a promising and powerful solution for generating rich additional learning data [29], as demonstrated in this work. BAGAN [35] is a particularly interesting GAN architecture, as it is specifically designed to address the aforementioned limitations. Despite its potential, to the best of our knowledge, the application of BAGANs in remote sensing has been limited to object detection [40] until now.
This study shows that BAGAN-inspired architectures outperform more common GAN approaches like ACGAN [39] in providing effective deep learning-based data augmentation, particularly when working with limited and imbalanced multispectral datasets. Consequently, BAGAN-inspired architectures are proposed as a new baseline for researchers working with GANs in remote sensing. Another crucial finding of this study is that integrating additional techniques can further improve the quality of the GAN-based data augmentation and the final classification accuracy, as demonstrated by the proposed method ResBaGAN, which improves upon BAGAN.
ResBaGAN's adaptability presents a significant advantage for the future, as it can be easily integrated with classifiers from other works to increase their accuracy. To achieve this, the classifier only needs to be integrated as the D network within ResBaGAN, with adjustments made to the design of the G network to match the capabilities of both modules, as illustrated throughout this work.
The primary limitation of ResBaGAN is its long training time compared to other deep learning approaches. However, it presents a good tradeoff between cost and performance. As shown in the experiments in this work, none of the other classification methods achieved 80% for the F1 score across all classes. It is also worth noting that the classification cost of Res-BaGAN mainly depends on the computational cost of the chosen classifier, so it could be reduced if the classifier is replaced by one with lower complexity. One additional advantage of ResBaGAN is that applying the superpixel segmentation to the image greatly reduces the computational cost of classification compared to a pixel-focused approach. This features makes ResBaGAN well-suited for rapidly mapping extensive terrain areas in real applications involving remote sensing classification.
Another significant contribution of this work is the successful adaptation of the FID score metric to assess the quality of GAN-based synthesis while using the full spectral resolution of remote sensing data. This modification is necessary as there are currently no standard models for evaluating FID on remote sensing data, similar to the Inception v3 network used for the RGB data. This adaption allows for more accurate comparisons among GAN-based networks and can be easily included in other research works. Nonetheless, it would be valuable to explore the development of standard models for evaluating the FID score on remote sensing data.
Various aspects of ResBaGAN open up future work directions to improve its performance. For instance, the BAGAN-based designs in this work do not feature the sophisticated latent vector sampling strategy, which could further ease working with limited data. Another possibility for improvement could involve integrating state-of-the-art deep learning architectures like GCN [18] and transformers [19] as classifiers in ResBa-GAN. Moreover, other traditional augmentation techniques not considered in this analysis, such as those described in [25] or [26], could be integrated into the proposed solution.
Regarding the field of application, this study focused on applying ResBaGAN to mapping images of forest areas. It would also be interesting to evaluate ResBaGAN on new datasets to further explore its strengths and limitations, whether in vegetation terrains or in other types of images. Moreover, to evaluate the robustness of ResBaGAN, it would be valuable to analyze its performance using datasets captured under variable illumination and atmospheric conditions, or including noise produced by the sensor. As ResBaGAN extracts patches centered on superpixels and works with the whole spectral dimensionality of the image, it is expected that it could handle the intraclass spectral variability produced by these varying conditions to some extent.
The solution proposed in this work operates over very highresolution images. However, ResBaGAN could be adapted to operate over remote sensing images with different spectral and spatial resolutions by adapting its stages. For instance, modifying patch and superpixel size parameters would be necessary for dealing with datasets with different spatial resolutions. For coarser grained datasets, such as satellite images, employing spectral unmixing techniques [76] might be necessary for situations where multiple, different elements are captured within the same pixel.

VI. CONCLUSION
This work proposes ResBaGAN, a deep-learning method for the classification of remote sensing images, designed to overcome the prevalent challenges of data scarcity and class imbalances. This is achieved by constructing an advanced data augmentation framework that combines BAGAN-inspired augmentation, sample extraction on top of traditional augmentation techniques and superpixel segmentation, and an improved ResNet-based classifier. This integrated approach maximizes the usage of spectral and spatial information from the available training data to effectively overcome the aforementioned issues. Moreover, the superpixel segmentation also makes ResBaGAN suitable for recognizing large terrain areas.
To the best of the authors' knowledge, this is the first study that applies a BAGAN-based architecture to remote sensing classification. In addition, this work also proposes an adaptation of the FID score metric for evaluating the synthesis quality of GANs in multi-and hyperspectral remote sensing images.
ResBaGAN's performance was evaluated in the context of environmental monitoring by using eight large, very highresolution multispectral images of riparian forests, with limited learning data and strong class imbalances. A comparison to other state-of-the-art classification methods in the remote sensing field, such as ACGAN, was carried out. The results revealed the effectiveness of ResBaGAN in achieving high classification accuracies in constrained datasets with improved performance across all classes of elements, particularly in minority ones.
Finally, numerous interesting research directions have been identified for future work. For instance, it would be valuable to assess the performance of ResBaGAN on new datasets featuring diverse spatial and spectral resolutions. Additionally, exploring the integration of more advanced components into ResBa-GAN could yield further improvements, such as substituting the traditional augmentation techniques with more sophisticated methods or replacing the residual classifier with cutting-edge architectures like transformers.