Generating Hard Examples for Pixel-wise Classification

Pixel-wise classification in remote sensing identifies entities in large-scale satellite-based images at the pixel level. Few fully annotated large-scale datasets for pixel-wise classification exist due to the challenges of annotating individual pixels. Training data scarcity inevitably ensues from the annotation challenge, leading to overfitting classifiers and degraded classification performance. The lack of annotated pixels also necessarily results in few hard examples of various entities critical for generating a robust classification hyperplane. To overcome the problem of the data scarcity and lack of hard examples in training, we introduce a two-step hard example generation (HEG) approach that first generates hard example candidates and then mines actual hard examples. In the first step, a generator that creates hard example candidates is learned via the adversarial learning framework by fooling a discriminator and a pixel-wise classification model at the same time. In the second step, mining is performed to build a fixed number of hard examples from a large pool of real and artificially generated examples. To evaluate the effectiveness of the proposed HEG approach, we design a 9-layer fully convolutional network suitable for pixel-wise classification. Experiments show that using generated hard examples from the proposed HEG approach improves the pixel-wise classification model's accuracy on red tide detection and hyperspectral image classification tasks.

Abstract-Pixel-wise classification in remote sensing identifies entities in large-scale satellite-based images at the pixel level. Few fully annotated large-scale datasets for pixel-wise classification exist due to the challenges of annotating individual pixels. Training data scarcity inevitably ensues from the annotation challenge, leading to overfitting classifiers and degraded classification performance. The lack of annotated pixels also necessarily results in few hard examples of various entities critical for generating a robust classification hyperplane. To overcome the problem of the data scarcity and lack of hard examples in training, we introduce a two-step hard example generation (HEG) approach that first generates hard example candidates and then mines actual hard examples. In the first step, a generator that creates hard example candidates is learned via the adversarial learning framework by fooling a discriminator and a pixel-wise classification model at the same time. In the second step, mining is performed to build a fixed number of hard examples from a large pool of real and artificially generated examples. To evaluate the effectiveness of the proposed HEG approach, we design a 9-layer fully convolutional network suitable for pixel-wise classification. Experiments show that using generated hard examples from the proposed HEG approach improves the pixel-wise classification model's accuracy on red tide detection and hyperspectral image classification tasks.

I. INTRODUCTION
P IXEL-WISE classification is the task of identifying entities at the pixel level in remotely sensed images, such as Earth-observing satellite-based images from multi-or hyperspectral imaging sensors. The pixel-wise classification has some parallels to image segmentation. Still, there are several limitations to directly using the state-of-the-art image segmentation methods for the pixel-wise classification. Image segmentation methods treat an image as a composition of multiple instances of a scene or object and delineate boundaries between different instances. Current state-of-the-art image segmentation methods adopt the ability to segment these instances either by using a joint detection and segmentation model [1]  Wonkook Kim is with Department of Civil and Environmental Engineering, Pusan National University, Busan, Korea, 46241 (e-mail: wonkook@pusan.ac.kr).
© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Real examples
Hard examples Step I: Generating Step II: Mining

Hard Example Generation (HEG)
Hard example candidates by finetuning a detection model [2]. However, these detection abilities are only useful if the target object or scene provides category-specific contextual or structural information and if each instance covers a relatively large area of the image. Unfortunately, these requirements are not typically met in remotely sensed images; thus, spectral characteristics embedded in each pixel are used as viable pixel-wise classification information.
One of the issues with pixel-wise classification for remote sensing images is the lack of fully annotated large-scale remote sensing datasets. Since it is exceptionally challenging to annotate each pixel of the remote sensing image, frequently, many pixels in the image remain unlabeled, leading to performance degradation. Further performance decrease is caused by sparse training data, including few hard examples necessary to generate a robust classification hyperplane.
To address the lack of hard examples, we introduce a hard example generation approach (HEG) suitable for pixelwise classification (Figure 1). The proposed HEG approach takes two steps: i) generating hard example candidates that were recognized as false positives for other categories while preserving the properties of the original category (generation step) and ii) processing hard example mining to discover hard examples incorrectly detected with high loss (mining step).
In the first step, we use a variant of the generative adversarial learning (GAN) [3] to train a generator that creates hard arXiv:1812.05447v3 [cs.CV] 7 Apr 2022 example candidates. To prevent the generated examples from losing the specific property of its corresponding category, we trained a network to distinguish the real examples from the artificially generated examples, which serves as discriminator in the GAN framework. In order for the generated example to be a hard example for another category while preserving the original category's properties, a pixel-wise classification model and the discriminator become the counterparts that the generator should deceive.
For the second step, we redesigned the online hard example mining (OHEM) [4]  To evaluate the proposed HEG approach, we implemented a 9-layer fully convolutional network (FCN) inspired by [5]. The FCN architecture has proved to be suitable for pixelwise classification [5]- [8]. We validate our approach to red tide detection using the large-scale remote sensing image dataset obtained from multi-spectral GOCI (Geostationary Ocean Color Imager) [9] on a geostationary satellite. We chose this practical task because it clearly presents the groundtruthing problems mentioned earlier 1 . Due to such challenging ground-truthing problems inherent in remote sensing, red tide occurrences are labeled only at a limited number of locations. Moreover, there are no labels about where no red tide was found that could be used as negative examples in training. Therefore, we end up with only a small number of spectral examples from a fraction of areas where red tide occur in training. In this work, we use the images taken in December as negative examples where red tides do not occur due to the low water temperature 2 . Figure 2 shows the GOCI images used for the positive (red tide) and negative (non-red tide) training examples, and the red tide region annotation of the positive image.
From this peculiar GOCI image setting, we found severe issues highlighting the need for the proposed HEG approach. First, the spectral characteristics of the images taken in December are very different from those of the images taken in the summer when red tides mostly occur because the marine environment in summer and winter is very different. Therefore, the negative examples from the December images do not generally represent the non-red tide area from the images collected in the summer. Second, the imbalance between the numbers of positive examples and the negative examples is quite significant. The number of positive examples is in the order of one hundred pixels per image. In comparison, the 1 Since the biological properties of red tide are not clearly visible in the image, we used the information on real-world red tide occurrences reported by NIFS (National Institute of Fisheries Science) (http://www.nifs.go.kr/red/ main.red/) of South Korea. NIFS manually examined red tide occurrence only at a limited number of locations along the southern seashore of South Korea. 2 In South Korea, summer is in July and August and winter in December. Red tide occurs mainly in summer when the water temperature is high. number of negative examples is about 31M pixels per image as all the pixels of the GOCI image (5567×5685) taken in winter are used as negative examples. Lack of non-red tide examples derived from the property discrepancy between the summer and winter images associated with the first problem is addressed by the first step of HEG (i.e., generation of hard example candidates). The second problem of the data imbalance between the positive and negative examples is effectively alleviated by the second step of HEG (i.e., hard example mining).
We conducted extensive experiments to determine how the proposed HEG addresses the problems that arise in training the pixel-wise classification model on GOCI images. For red tide detection, a one-class classification problem with significantly unbalanced distribution, we use HEG to generate hard negative examples. To show that the proposed HEG can be easily extended to other tasks with multiple categories, we also applied it to several pixel-wise hyperspectral image classification tasks. Experiments have confirmed that pixelwise classification method trained by adopting the proposed HEG significantly enhances performance for red tide detection and several hyperspectral image classification tasks.

II. RELATED WORKS
Training generator via adversarial learning. Szegedy et al. [10] introduced a method to generate an adversarial image by adding perturbation to be misclassified by a CNNbased recognition approach. These perturbed images become adversarial images to the recognition approach. Goodfellow et al. [3] introduce two models: a generator that captures the data distribution and a discriminator that estimates the probability that an example came from the training data rather than the generator. A generator and a discriminator are trained at the same time in a direction to interfere with each other. This is called an adversarial learning framework.
Radford et al. [11] devised an image generation approach based on CNN by adopting this adversarial learning framework. Wang et al. [12] used the adversarial learning framework to train a network that creates artificial occlusion and deformation on images. An object detection model is trained against this adversary to improve performance. Hughes et al. [13] introduce a negative generator based on an autoencoder that takes a positive image as an input. The generator is optimized to make the output acquire the properties of a real image by adopting an adversarial learning framework. This generator is used to augment the set of negative examples, which are not necessarily hard negatives. Choi et al. [14] use GANbased data augmentation for reducing a domain gap from fully annotated synthetic data to unsupervised data. Xie et al. [15] also uses adversarial learning to augment training examples for image recognition. In the proposed work, we also use adversarial learning to train a hard example generator (HEG). Unlike [13]- [15], our HEG generates hard negative examples, which are more challenging to be identified as non-red-tide examples by our red tide detector than other real negative examples.
Hard example mining. Sung and Poggio [16] first introduced hard negative mining (also known as bootstrapping) that trains  the initial model with randomly chosen negatives and adapts the model to hard negatives that consist of false positives of the initial model. Thereafter, hard example mining has been widely used in various applications such as pedestrian detection [17], [18], human pose estimation [19], [20], action recognition [21], [22], event recognition [23], object detection [24]- [27], and so on. There are alternative ways to find hard examples using heuristic [28] or other hard example selection algorithms [29], [30], which avoid training multiple times. Kellenberger et al. [31] use active learning, which requires human intervention to assign labels to critical examples and address the problem of the small number of positive samples. Shrivastava et al. [4] introduced online hard example mining, which, for every training iteration, carries out hard example mining that chooses examples with high training loss. However, it is too exhaustive to evaluate all examples on each iteration. Hence, when using an extensive example set like our problem, it is impractical to examine all examples for each iteration. Therefore, we use OHEM in a cascaded fashion to randomly select a subset of examples and then perform efficient mining on it.
CNN used for detecting natural phenomena in marine environment. Since CNN has provided promising performance in image classification [32]- [34], there have been several attempts to use it in the marine environment. CNNs have been effectively used for detection of coral reefs [35], [36], classification of fish [37]- [39], detection of oil from shipwreck [40], [41], and so on. However, applying deep neural network to detect objects-of-interest in the marine environment has been quite limited due mainly to difficulties in acquiring large amounts of annotated data, unlike general object detection applications. In this paper, we devise a CNN training strategy coupled with an advanced network architecture tailored to red tide detection while minimizing human labeling efforts.

A. Red Tide Detection
In this section, we describe the proposed CNN-based red tide detection approach. This approach takes the GOCI image as input and evaluates whether each pixel in the image belongs to a red tide area or not. Therefore, red tide detection can be considered as pixel-wise classification. The architecture of the proposed approach is built on a model introduced by [5], which is known to be suitable for pixel-wise classification. We apply a sliding window method to deal with limited GPU memory when processing GOCI images.
Pixel-wise classification. Pixel-wise classification has been widely used for mulitispectral/hyperspectral image classification that assigns each pixel vector into a corresponding category by exploiting the spectral characteristics of both the pixel and the neighboring pixels in a local region. Unlike general image segmentation [1], [2], which segments distinctive scene components in an image by primarily leveraging object appearance as well as structural characteristics and attributes of the components (e.g., human anatomy, car with four wheels), pixel-wise classification is a task of predicting each pixel in a region with additional features, such as spectral profiles, and simultaneously little structural attributes available. Therefore, in the proposed work, red tide detection is treated as a pixelwise classification problem primarily because the red tide is a microscopic alga with no discernible appearances or structures useful for image segmentation.
For recent CNN-based image segmentation, the state-ofthe-art approaches have a CNN architecture designed as sequentially stacking multiple layers consisting of filters that capture neighboring information (e.g., 3×3, 5×5 filters) to leverage information over a large area when predicting each pixel. Furthermore, it adopts multiple downsampling layers such as pooling/convolution layers with stride≥2, which are known to encode the structural characteristics adequately. On   the other hand, our approach, described in the next section, also adopts CNN architecture. It is designed by stacking layers made up of 1×1 filters except for the initial layers and does not use any downsampling layers. The first layer, consisting of multi-scale filters, does not capture structural characteristics but rather imposes spatial continuity over neighboring pixels such that they have the same identity.
Architecture. The architecture of the proposed red tide detection model is shown in Figure 3. To cope with the pixelwise classification of red tide detection, we use a 9-layer fully convolutional network (FCN), which intakes an image of arbitrary size. The network takes image patches of 25×25 as input during training, while an image patch of a certain dimension determined by the maximum size of GPU memory is fed into the network in test time. The network's initial module is a multi-scale filter bank consisting of convolutional filters with four different sizes (1×1, 5×5, 9×9, and 13×13). The architecture of the multiscale filter bank is slightly different between training and testing. Given an image patch of 25×25 in training, each k × k filter is convolved with a patch of (2k − 1) × (2k − 1) centered on the 25×25 patch. The size of the smaller patch, i.e., (2k −1)×(2k −1), is determined so that each convolution always includes the center of the larger 25x25 patch. For example, when applying a 5×5 filter to a 9×9 patch, the 5×5 window always contains the center pixel of the 25×25 patch being evaluated. After the initial convolution, a max pooling is applied to the outputs of the convolutional filters so that those pooled feature maps have a size of 1×1 except for the 1×1 convolution. In the test, four filters are applied to the same patch of the same size. These convolutions use appropriate padding to ensure that the four pooled feature maps have the same size. Four output feature maps are concatenated for both training and testing and then fed to the second convolutional layer. Accordingly, due to the multi-scale filter bank architecture, our network becomes 25×25, and the network's receptive field uses spatial information based on this receptive field when evaluating each pixel.
In training, dropout layers, which are commonly used to solve the overfitting issue to some extent, are added at the end of the 7 th and 8 th layers. The rest of the network is the same in training and testing. Specifically, the binary sigmoid classifier that is useful for either single-label or multi-label classification is used for the output layer to identify other natural phenomena (e.g. sea fog, yellow dust, etc.) from the GOCI images later using the same architecture.

B. Training: Adopting Hard Example Generation
To meet the need for hard examples in devising an accurate hyperplane with a small number of examples that can be adequately applied to test examples, we introduce hard example generation (HEG) approach. It takes two steps: i) generation of hard example candidates (generation step) and ii) hard example mining (mining step). This section provides details for each step and our training strategy to jointly train the red tide detection model with HEG. Note that the details given are primarily focused on the red tide detection task but can easily be extended to other pixel-based classifications.
Generation step. This step develops a generator that creates artificial examples that the red tide detection model likely classifies as false positives. The proposed generator creates hard negative example candidates only for single-category red tide detection. However, it can still be applied to other multicategory pixel-based classification tasks for generating hard example candidates for multi-category positives and negatives. This extension will be presented with hyperspectral image classification in Section V. The generator is designed as a 10layered conv-deconv network consisting of eight convolutional layers and two deconvolutional layers inspired by U-Net with high image generation capability [42]- [46], as shown in Figure 4.
We aim to achieve two goals in the training of the generator. First, the generator must be able to fool the red tide detection model so that the generated examples are incorrectly classified as red tides. Second, generated examples should have typical non-red-tide spectral characteristics. To achieve the goals, we introduce a discriminator that distinguishes real examples from artificially generated ones. The generator is trained to deceive the discriminator as in the typical GAN framework [3]. The discriminator consists of four convolutional layers and one fully-connected layer, as shown in Figure 4. Generated examples become hard example candidates that are designed to maximize the losses of the red tide detection model and the discriminator, which conflicts with the two models' objectives. The training process of the generator and the discriminator is shown in Figure 4.
To mathematically formulate the process of generating hard negative examples, the red tide detection model and its loss are represented by F rtd and L rtd , respectively. The red tide detection model is trained by minimizing its loss expressed as: where E and L rtd are training examples and their associated labels, respectively. For each example e ∈ E, its red tide labels l rtd ∈ L rtd can be either 1 (red tide) or 0 (non-red tide). H(p, q) is the cross-entropy for the distributions p and q.
The discriminator and its loss are denoted as F d and L d , respectively, in Equation 2. The discriminator is optimized by minimizing its loss which is expressed as: where N is a set of real negative examples. The discriminator's labels can be either 1 (real example) or 0 (artificially generated example). Accordingly, the generator loss (L g ) can be expressed as: where 1 indicates that labels associated with the generated negative examples are red tide for F rtd or real negatives for F d . The generator can be trained by minimizing L g (N ), where F g (N ) becomes adversarial examples for the red tide detection model and the discriminator. 3-stage training strategy. For training the red tide detection model using the proposed two-step HEG approach, we adopt a 3-stage training strategy. The first stage is to train the initial red tide detection model using cOHEM. In the second stage, the generator and the discriminator are trained, as shown in Figure 4. In this stage, the discriminator is first trained with generator weights unchanged and then the generator is trained while keeping the discriminator and the red tide detection model fixed. In the last stage, the red tide detection model is updated by using hard examples that are the output of the proposed hard example generation approach. Hard examples are mined from real positives, real negatives, and artificially generated negatives via cOHEM. In the third stage, generator weights are fixed. For

IV. EXPERIMENTS: RED TIDE DETECTION
A. GOCI Satellite Images GOCI (Geostationary Ocean Color Imager) acquires multispectral images from a large area surrounding the Korean peninsula. The GOCI image [9] has 8 channels consisting of six visible and two near infrared (NIR) frequency bands 3 and 500 m spatial resolution. The size of the GOCI image is 5567×5685. Some examples of GOCI images are shown in Figure 2. Several red tide examples on GOCI multi-spectral images are also shown in Figure 7.
In this paper, we use GOCI images taken in July, August, and December of 2013 to evaluate our red tide detection model. Images from July and August where red tide occurred are used as positive images, and images from December are used as negative images. Based on some conditions such as the atmosphere, we chose eight images in July and August and four images in December. Half of them were used for training and the other half for testing.
To label red tide pixels, we used the red tide information reported by NIFS (National Institute of Fisheries) of South Korea which directly tested seawater from a ship. NIFS examined red tide occurrence only at a limited number of locations; so it is impossible to cover the entire red tide areas. Furthermore, the red tide positions indicated in the reports were not very accurate due to the error-prone manual process that included mapping geo-coordinates of red tide locations onto GOCI images. Hence, we have extended potential red tide regions up to 25 km (50 pixel distance) in all directions from the red tide location indicated in the report and then labeled red tide with experts' help. Approximately 100 pixels from each training image were sparsely labeled as a red tide area. We used pixels labeled as red tide as positive examples and all the pixels of the December images as negative examples.

B. Evaluation Settings
Evaluation metrics. We used two different metrics to evaluate the proposed model: the receiver operating characteristic (ROC) curve and the ROC variation curve. The ROC variation curve describes changes in the detection rate based on varying numbers of (true or false) detections per image (NDPI) instead of the false positive rate. This metric is beneficial when there are numerous unlabeled examples whose identity is unknown. Note that in a GOCI image only a fraction of red tide pixels are labeled and the rest of the image remains unlabeled. For quantitative analysis, we calculate the AUC (the area under the ROC curve) and ndpi@dr=0.25, ndpi@dr=0.5 and ndpi@dr=0.75 indicating the NDPI values when the detection rate reaches 0.25, 0.5, and 0.75, respectively.
Model training. The proposed models are trained from scratch. When HEG is used, we used a three-stage training strategy and trained the model with 1250 iterations for each stage. A base learning rate is 0.01 for the red tide detection model and generator and 0.0001 for the discriminator. The base learning rate drops to a factor of 10 for every 500 iterations. When the three stage training strategy is not used to train the model (i.e. artificially generated examples are not used for training), we trained the model with 2500 iterations.
A base learning rate is 0.01 and drops by a factor of 10 for every 1K iterations.
The proposed models are optimized by using a mini-batch Stochastic Gradient Descent (SGD) approach with a batch size of 256 examples, the momentum of 0.9, and weight decay of 0.0005. The red tide detection model's training objective is to minimize the cross entropy losses between the red tide labels and the final output scores. Each batch consists of examples extracted from one positive image with red tide occurrence and one negative image with no red tide occurrence. The positiveto-negative ratio in each batch is set to 1:3.
To reduce overfitting in training, data augmentation is carried out. Since a GOCI image is taken from a top view, training examples are augmented by mirroring across the horizontal, vertical, and diagonal axes. This mirroring can be performed in one direction or in multiple directions. This will increase the number of examples by eight times.
When training the red tide detection model, all learnable layers except for the layers of residual modules (3 rd , 4 th , 5 th , and 6 th layers) are initialized according to Gaussian distribution with zero mean and 0.01 standard deviation. The layers of the residual modules are initialized according to Gaussian distribution with a mean of zero and a standard deviation of 0.005. All layers of the generator except for the last layer are initialized according to Gaussian distribution with a mean of zero and a standard deviation of 0.02. The last layer is initialized according to Gaussian distribution with a mean of zero and a standard deviation of 50.

C. Architecture Design
In this section, we use a 2-fold cross validation that splits the training set into two subsets, alternating one for training and another for testing. AUC reported in the tables is the average over two validations.
Finding the model specification. To find the optimal specification of the red tide detection model, we evaluate the model by changing various types of model parameters, such as the number of filters and residual modules, and the types of filters used in the multi-scale filter bank. The proposed model specifications are determined by evaluating detection accuracy (AUC), training time, and test time on various model parameters. The final model specification used in the proposed work is shown in Table I. Table I also indicates if a more extensive network is used by increasing the depth and breadth of the network, network overfitting on the GOCI dataset starts to occur. We use one GPU (NVIDIA Titan XP) that affects training and test times to conduct our experiments.
We also optimize the generator by changing the number of filters. As shown in Figure 4, the generator is designed as a conv-deconv network consisting of eight convolutional layers and two deconvolutional layers. In this architecture, the number of filters in the first layer n is doubled in the third layer and then reduced by half in the sixth and again in the ninth layer. The last layer has eight filters so that its output has the same eight channels as those of the GOCI image's spectral signal. We evaluate detection accuracy and training time to find optimal architecture while n is varied among 16, 32, 64, and    Table III, the third strategy gives the best performance in terms of AUC and training time. This observation indicates that increasing spatial diversity of sampling is essential in providing competitive performance. Therefore, even though the third strategy requires a large amount of memory due to the large pixels, it is adopted in our training approach.

D. Experimental Results
Baselines. We implement three baselines: SVM and two CNNbased hyperspectral image classification approach [5], [50]. In SVM, a 25×25 region centered on the pixel in test is used as a feature representing the center pixel. To know the advantages of adopting hard example mining, we applied conventional hard negative mining [16] to SVM training. [5] is the CNN-based approach by which our model has been inspired. Another CNN-based baseline, Diverse Regionbased CNN (DR-CNN) [50], inputs a set of diverse regions consisting of six different regions (i.e., global , right, left, top, bottom, and local regions) to encode semantic context-aware representation. In this experiment, we use a 25×25 image patch as a global region compatible with the input dimensions of our approach. 13×25 sub-patch at top and bottom of the patch are used as the top and bottom regions. Similarly, 25×13 sub-patch at the left and right of the global patch are the left and right regions, respectively. The 3×3 region at the center of the global patch is used as a local region.
Performance comparison. Table IV shows that our model trained using hard examples via HEG provides the highest accuracy in all four metrics. The proposed HEG was effective in improving the performance of our model and two CNNbased baselines. It is also observed that adopting a hard example mining approach consistently improves the accuracy of all four methods as it efficiently eases the significant imbalance between red tide and non-red tide examples. Figure 8 shows the ROC variation curves for our model and baselines. From Figure 8, we can confirm that our model provides significantly enhanced detection performance compared to the baselines over the most range of NDPI. Figure 9 shows red tide detection results from our approach.

E. Analyzing Hard Negative Candidates
In Figure 10, we observe that the generator and discriminator converge successfully in the 2 nd training stage. This shows that generator optimization overcomes the interference of the red tide detection model and the discriminator.
Some examples of generated hard negative candidates, the successfully trained generator's output, are presented in Figure 11. In Figure 11, it can be observed that activated pixel IV: Red Tide Detection Accuracy. For each metric, numbers in bold indicate the best accuracy. Note that the higher the AUC value, the better the performance, and the smaller the value of NDPI, the better the performance.   regions (the 3 rd column of each set) tend to have different colors-i.e., different intensity values for some spectral bandsfor certain areas than the ones covered by real non-red tide examples. Note that the activated regions in Figure 11 do not show typical characteristics of red tide regions-narrow elongated bands with sharp boundaries as shown in Figure 7. Furthermore, the activated regions by artificial non-red examples appear to be pink while the real red tide is generally red.   Figure 13. We use the output of the 8 th convolutional layer (after ReLU layer) as the selected   Figure 13.

F. Ablation Study
Training with unlabeled examples. The main problem with using GOCI images is the labeling of red tide pixels. It is quite challenging to label every pixel where a red tide appears on GOCI due to practical issues. Therefore, there is no guaranty that pixels not labeled as the red tide in positive images (i.e. red tide images) are non-red tide pixels. We carried out ablation experiments to validate our claim that pixels not labeled as red tide in positive images should not be used to train the proposed model. When unlabeled pixels are used, we also used all pixels of negative images for training. Table V provides accuracy with and without the  Generating negative examples with the discriminator only.
To demonstrate the effectiveness of fooling red tide detector for generating hard negative example candidates, we also test a generator trained by fooling the discriminator only. The generator training approach becomes GAN [3] in that it consists of a discriminator and a generator, and these networks learn to restrain each other. Augmenting the training data by GAN-based image generation increased accuracy in many approaches [13], [52]- [54]. However, Table VI shows that the GAN-based data augmentation approach is significantly less performed than our approach. Moreover, it was even worse than the case without using generated examples. These observations indicate that deceiving the red tide detector in  Figure 4. The only difference is to add the "Label Generation" module, which calculates the adversarial label for each example, at the end of the hyperspectral image classification model.
training the generator is essential for improving accuracy.

V. EXPERIMENTS: HYPERSPECTRAL IMAGE CLASSIFICATION
Our approach can also be easily generalized to hyperspectral image (HSI) classification, requiring pixel-wise classification. For the HSI classification, we have used three benchmark datasets: Indian pines, Salinas, and PaviaU. For each HSI dataset, 200 pixels randomly selected from each category are used for training and all the remaining pixels are used for testing.
Evaluation setting. For hyperspectral image classification, we carried out experiments on three hyperspectral datasets: Indian Pines, Salinas, and PaviaU. The Indian Pines dataset is an image consisting of 145×145 pixels with 200 spectral reflectance bands covering the spectral range from 0.4 to 2.5 µm with a spatial resolution of 20 m. There are 16 material categories in the Indian Pines dataset, but only eight materials with relatively large samples are used for evaluation. The Salinas dataset includes 16 classes with 512×217 pixels, 204 spectral bands, and a high spatial resolution of 3.7 m. The Salinas and the Indian Pines datasets have the same frequency characteristics because they are acquired by the same AVIRIS sensor. The PaviaU dataset acquired by ROSIS sensor has 610×340 pixels with nine material categories and 103 spectral bands covering 0. 43 where l hic is a label associated with the example e and C is the set of all categories. Generator optimization is carried out with Equation 3 by replacing L rtd (denoted as 1 in the Equation) with L a hic = {l a hic }. Architecture. Unlike the red tide detection model that outputs one-dimensional sigmoid probabilities, the hyperspectral image classification model uses the softmax layer to calculate multiple categories' final probabilities. We also use the softmax loss to train the model while the red tide detection model is optimized by minimizing the cross-entropy loss.
When training the generator in the 2nd training stage, we add the module ("Label Generation" in Figure 14) to the end of the network. This module finds an adversarial label for each example using Equation 4. The example becomes a hard example for its adversarial label when training the model.  Results. Table IX shows the classification accuracy for the three datasets. For all the three datasets, our 9-layer CNN outperforms the baseline introduced in [5]. Our strategy of generating hard example generation further improves performance by at least 0.44%.
Note that the problems we encountered with GOCI images (i.e. extreme sparsity of red tide samples, significant imbalanced distribution, difficulties in accurate groundtruthing, etc.) are not normally observed in other hyperspectral images. Enhanced performance for HSI classification verifies that the proposed hard example generation approach is also applicable to other related problem domains, such as HSI analysis.

VI. CONCLUSION
In this paper, we have developed a novel 9-layer fully convolutional network (FCN) suitable for pixel-wise classification. Due to the challenges of annotating Earth-observing remotely sensed images to the pixel level, there are very few fully annotated satellite-based remote sensing data. To avoid the performance degradation caused by significantly insufficient and imbalanced training data, we introduce a novel approach based on hard example generation (HEG). The proposed HEG approach takes two steps, first generating hard example candidates and mining hard examples from real and generated examples. In the first step, the generator that creates hard example candidates is learned via the adversarial learning framework by fooling a discriminator and a pixelwise classification model at the same time. In the second step, online hard example mining is used in a cascaded fashion to mine hard examples from a large pool of real and artificially  generated examples. The proposed FCN jointly trained with HEG approach provides state-of-the-art accuracy for red tide detection. We also show that the proposed approach can be easily extended to other tasks, such as hyperspectral image classification.

APPENDIX
Our FCN architecture is designed to be slightly different between training and test to make it suitable for pixel-wise classification. The size of the input is also different for individual training stages. Furthermore, in the first and third training stages, the sizes of positive and negative examples are different for the same stages, making it easy to understand the architecture. Accordingly, to get an accurate understanding of the architecture, we provide the size of the intermediate output of the model in Table X. The red tide detection model weights are transferred between training and testing and during different training phases.

ACKNOWLEDGMENT
This research was supported by the "Development of the integrated data processing system for GOCI-II" funded by the X: The Size of the Intermediate Output. The numbers in parentheses indicate the width, height, and channels of the intermediate output, respectively. C, D, and F C denote the convolutional layer, the deconvolutional layer, and the fully connected layer, respectively. Ministry of Ocean and Fisheries, Korea.