Robust Semisupervised Land-Use Classification Using Remote Sensing Data With Weak Labels

This work develops robust semisupervised classifiers to tackle the three most challenging problems in land-use classification using remote sensing data, namely, information imbalance, label noise, and image uncertainty. Limited by technology and cost, collecting clean labels for remote sensing images is difficult and often impractical. The change of environment and time also increases the uncertainty of remote sensing images. To overcome the obstacles incurred by the mixed pixels and weak labels, this work proposes dividing the pixels in remote sensing images into two groups, namely, pixels with accurate labels and those with weak labels, before processing the weakly labeled pixels using a nuclear norm-based cost function. To address the imbalanced data problem in pixels with accurate labels, an improved cross-entropy-based cost function is proposed to weigh the contributions from data of different classes based on their importance by exploiting the term frequency-inverse document frequency (TF-IDF) algorithm. Finally, an artificial class called “unknown” is proposed to cope with the interference caused by weakly labeled data with unrepresentative spatial features. Extensive experiments validate the effectiveness of the proposed semisupervised classifier.


I. INTRODUCTION
In the history of earth observation, land-use information has been considered a key factor in observing human development. Reliable and accurate land-use information is critical for understanding historical land use and planning for future land use. The advent of remote sensing technology has enabled the accurate and dynamic monitoring of land-use changes and global resource distributions in a periodic and timely manner [1], [2]. In particular, high-resolution remote sensing images can provide detailed land-use information, which enables us to perform a thorough study of the changes in land resource distributions. As a result, deep learning technology has been widely adopted in remote sensing. Deep learning-based algorithms are highly efficient in processing large-scale high-resolution remote sensing images to reveal hidden spatial features, which helps improve our understanding of remote sensing images. For instance, the seminal work The associate editor coordinating the review of this manuscript and approving it for publication was Khin Wee Lai . on fully convolutional networks (FCNs) for semantic segmentation proposed in [3] has inspired tremendous research interest, and deep learning technology has been widely adopted in remote sensing [4]- [9]. However, it has been observed that the generalization capabilities of these deep learning-based algorithms are unsatisfactory [10], which hinders the adoption of these deep learning-based algorithms for the automatic processing of large-scale remote sensing data.
The generalization ability of a classifier has a solid logical relationship with the quality of the dataset [11]. However, remote sensing data are different from traditional computer vision (CV) data. The differences include the information imbalance in natural image data, and natural image datasets also contain considerable noise in their labeling systems [12]. The common errors in these datasets are shown in Fig. 1. This research will focus on the three characteristics of remote sensing datasets: data imbalance, noise, and uncertainty [13], [14].
Imbalanced dataset information is an essential factor that leads to the degradation of classifier performance. For example, the labels of E1 E2 and E3 are erroneous; this appears to be the result of an unprofessional drafter. E4 is an error caused by the complexity of the land cover. E5 is not an error, but it represents a feature anomaly caused by cloud shadows. All samples come from the DeepGlobe LandCover CVPR2018 dataset.
The number of samples may vary dramatically across different land-use classes owing to the uneven spatial distribution of land resources. Therefore, such a problem of sample imbalance restricts the classification accuracy of small classes, which reduces the average classification accuracy of image segmentation. In the field of machine learning, the problem of imbalanced learning has been a topic of great interest [15], and learning the decision boundaries between different classes can be a very difficult task [16], [17].
In addition to information imbalance, label noise is a common problem in remote sensing datasets. The problem of label noise might be pervasive for the following reasons [18]: First of all, there is a high probability of label errors when the land cover in remote sensing images is highly complex or the information provided to drafters is minimal. Furthermore, the credibility is significantly reduced when an automatic label system or unprofessional drafters are used to cut costs. In addition, experts in different fields have different identification standards for the same land, which eventually leads to inconsistent labeling results. Finally, various noise interferes with remote sensing images when capturing and transferring data. When the classification datasets are corrupted, the performance degeneration issue of deep learning models becomes more severe than that of shallow classifiers. Therefore, researchers have developed techniques to combat data noise. Although there are some studies on the robustness of remote sensing classifiers, only a few pieces of research focus on label noise land-use classification. However, in the practical application of remote sensing, label noise is an urgent and inevitable problem.
Imbalanced data and noise are explained in the previous paragraphs. Uncertainty refers to the random abnormal features in remote sensing images. Compared with natural images, remote sensing images have higher robustness to uncertainty [19]- [21]. Because sensors collect cloud shadows and other irrelevant information, remote sensing images may contain many invalid features. If there is no effective mechanism to deal with these features without classification significance, then these invalid features will be uncontrollably distributed to different land-use classifications, which will cause the overfitting of classifiers.
This study uses a scheme similar to semisupervised learning and proposes a loss function composed of two components. The first component computes the distance between the label and the corresponding prediction matrix using an improved cross-entropy (ICE) approach. In addition, a new weight representing the importance of sample information has been added into the cross-entropy function. The second component is designed to maximize the rank of the prediction matrix by exploiting the nuclear norm. An increase in the rank of the prediction matrix means a decrease in 43436 VOLUME 10, 2022 redundant information. More specifically, the contributions of this research can be summarized as follows.
• To circumvent the imbalanced data problem, an effective solution for remote sensor image classification in the presence of noisy labels is provided. It is very general and can be seamlessly applied to current neural networks.
• To circumvent the imbalanced data problem, the term frequency-inverse document frequency (TF-IDF) [22] is introduced. An algorithm initially developed for document search and information retrieval is utilized to weigh the loss function based on the sample size of each class, and the weight is added into the cross-entropy computation; • There is an additional component of the nuclear norm in the loss function. The information redundancy in the prediction matrix is reduced by maximizing the nuclear norm. This is similar to minimizing information entropy but maximizing the kernel norm can avoid the performance degradation of the classifiers due to information imbalances.
• A new classification called the ''unknown'' class will be added to the classifier. None of the information in the dataset is about the ''unknown'' class. This class does not have any labels, so it cannot participate in the cross-entropy computation. However, it will have significant implications in nuclear norm maximization. In addition, the ''unknown'' class collects anomalous features to prevent overlearning of the classifier. Extensive computer experiments were performed to show that the resulting semisupervised classifier is highly robust against the mixed pixel, weak label, and imbalanced data problems by exploiting a smaller amount of weakly labeled data. The proposed classifier is particularly attractive because it can make use of weak data, such as historical data of the same area accumulated over the years. The remainder of this paper is organized as follows. Sec. II introduces the classical techniques for improving classifier performance. Sec. IV elaborates on the proposed semisupervised classifier, and the extensive simulation results are presented in Sec. V. Finally, the conclusions are presented in Sec. VI.
Notation: Vectors and matrices are denoted by boldface letters. A F and A ν denote the Frobenius and nuclear norms of A, respectively. Furthermore, [A] i,j denotes the i-th row and the j-th column element of A. rank(A) and trace(A) represent the rank and trace of A, respectively. In addition, A T and A H are the transpose and conjugate transpose of A, respectively. Finally, sets are represented by calligraphic letters, while |X | represents the cardinality of the set X .

A. INFORMATION IMBALANCE IN DEEP LEARNING
Imbalanced information is a traditional and common problem, and research on this problem has drawn extensive attention. For this problem, traditional solutions include resampling and reweighting. Chawla et al. [23] proposed a scheme called the synthetic minority oversampling technique to increase the importance of unusual samples. He and Garcia [24] explained how to process unbalanced data and explored the relationship between different resampling methods and classifier performance. Recently, Byrd et al. [25] discussed the relationship between the training samples' position and the classifier's performance. They think that when the samples are sufficient, the information balance can be better achieved by resampling. These methods help us understand the relationship between samples and classifier performance from the perspective of resampling. The representative scheme is reweighting. Khan et al. [26] proposed a cost-sensitive (CoSen) deep neural network, which can automatically learn robust feature representations for both the majority and minority classes. Cui et al. [27] were convinced that datasets contain information overlap, so they proposed a novel theoretical framework to characterize data overlap, and a class-balanced reweighting term that is inversely proportional to the adequate number of samples was added to the loss function. Cao et al. [28] alternatively studied the minimum margin per class and designed a label-distribution-aware loss function that encourages a model to have the optimal trade-off between per-class margins. Tan et al. [29] proposed equalization loss to tackle the problem of rare long-tailed categories by ignoring the gradients for rare categories. In recent years, all of these methods have become popular reweighting methods. Researchers have not stopped exploring the imbalanced information problem. Kang et al. [30] compared jointly learning a representation and classifier to many straightforward decoupled methods and found that instance-balanced sampling gives more generalizable representations that can achieve state-ofthe-art performance after properly rebalancing the classifiers. Zhou et al. [31] proposed a new model consisting of two branches, termed the ''conventional learning branch'' and the ''rebalancing branch,'' to simultaneously address both representation learning and classifier learning. These methods have also received more attention in recent years, although their methods increase the computational costs.

B. LABELS NOISE IN DATASETS
The high cost of acquiring satellite image labels is a well-known problem in the field of remote sensing. Almost all data sets related to semantic segmentation are faced with label noise. Label noise was first considered by pioneers in CV, and these pioneers have produced many exciting and significant research results. Angulin and Laird [32] asked the following question: how can a learning algorithm cope with incorrect training examples?. Since then, label noise has been the focus of researchers. Lawrence and Schölkopf [33] proposed an algorithm for constructing a kernel Fisher discriminant (KFD) from training examples with noisy labels. Natarajan et al. [34] theoretically studied binary classification in the presence of random classification noise and provided two approaches to suitably modify any given surrogate VOLUME 10, 2022 loss function. Liu and Tao [35] presented a necessary reweighting framework for classification in the presence of label noise. Theoretical analyses were provided to assure that the learned classifier will converge to the optimal noise-free sample. Applying these methods to natural images is successful, but their performance degrades when directly applied to remote sensing images. The specificity of remote sensing images causes this. Li et al. directed the label noise problem of remote sensing data based on the multifeature dictionary learning-based collaborative representation classifier (MDLCRC) [36], and a new RSSCoriented error-tolerant deep learning (RSSC-ETDL) approach to mitigate the adverse effect of incorrect labels in a remote sensing image scene dataset was proposed [37]. Kang et al. [38] used the newly defined robust normalized softmax loss (RNSL). In the same year, they proposed a new deep metric learning loss function, termed noise-tolerant deep neighborhood embedding (NTDNE), which can accurately capture the semantic relations among remote sensing scenes in a feature space [39]. These results show that the label noise problem has become a focus in the remote sensing field.

C. UNCERTAINTY OF REMOTE SENSING IMAGE
The natural surface of the Earth is composed of a uniform material. As a result, many pixels in remote sensing images may cover multiple substances with different spectral properties [40], [41]. In addition, each pixel in remote sensing images can exhibit spatial characteristics belonging to one or more classes, which may interfere with land-use classification. One naive solution to the mixed pixel problem is to decompose the multiclass classification problem into multiple independent single-class classification problems while ignoring the cross-class correlation. Unfortunately, mixed pixels usually demonstrate nonlinear mixing of different classes, particularly in high-resolution remote sensing images [42]. As reported by Stubenrauch et al. [43], on average, more than 50% of the Earth's surface is covered by clouds every day. Clouds and ''cloud shadows'' are symbiotic in remote sensing images. Arguably, any classifier faces the challenge of ''clouds'' and ''cloud shadows'' when it is used. In general, some commonly used methods, including band grouping/thresholding methods [44]- [46], traditional image segmentation methods [47]- [49], and deep learning-based segmentation methods [50]- [52] can lower the interference of these factors but cannot be eliminated. Notably, these features are complex and cannot be comprehensively characterized with accurate labeling. Many classifiers are designed without considering the uncertainty of remote sensing images. Therefore, more robust classifiers need to be designed to overcome the uncertainty of remote sensing images.

III. PROBLEM FORMULATION AND ASSUMPTIONS
Given a set of K equal-size remote sensing images with N total pixels each, the task of remote sensing image semantic segmentation is to develop a classifier to produce a prediction matrixÂ ∈ R N total ×N C for an input image, where N C is the total number of classes. Furthermore, each element Â i,j ≥ 0 represents the probability of the i-th pixel of the input image belonging to the j-th class with for i = 1, 2, . . . , N total and j = 1, 2, . . . , C N . The conventional supervised learning approach con-structsÂ by training on a large set of data samples with correct labels. This data requirement can be an issue of concern in practice when correctly labeled samples are not available. This study designs a robust semisupervised classifier by exploiting imbalanced remote sensing datasets with both accurate and weak labels. To facilitate the development of the semisupervised classifier, we propose dividing the pixels of the k-th image into two sets, namely, X (c) k for those pixels falling within the core area of the cluster with well-defined labels and X (b) k for those pixels with weak labels for k = 1, 2, . . . , K . Pixels in X (b) k are primarily in the boundary area and potentially belong to multiple classes. Fig. 2 illustrates a hypothetical example of three land-use classes. The pixels were divided into pixels with well-defined and weak labels. Note that pixels with weak labels are defined along each boundary line between any two classes. Before elaborating on our proposed classifier, we first state the three assumptions necessary for establishing valid semisupervised learning models [53].
• Smoothness assumption: Two geographically close pixels in a high-density region should have a strong spatial correlation and subsequently, similar classification labels of high probability [54].
• Cluster assumption: If two pixels are in the same cluster, they belong to the same class with a high probability. Furthermore, if the spectral characteristics of two pixels are similar, the probability of these two pixels possessing identical classification labels should be high [55]- [57].
• Manifold assumption: Remote sensing data reside roughly in a low-dimensional manifold. In other words, samples are assumed to have similar spatial characteristics in a small local proximity and therefore belong to similar classes [58]- [60].

IV. PROPOSED SEMISUPERVISED CLASSIFIER
In this section, we propose a semisupervised classifier to perform robust land-use classification by effectively exploiting weakly labeled and imbalanced remote sensing data with the inherent mixed pixel problem.

A. TF-IDF-BASED WEIGHTING
We begin with the pixels with accurate labels in X (c) k and address the imbalanced data problem. Conventionally, cross entropy is employed as the cost function to measure the discrepancy between the true labels and the predicted values in machine learning-based applications [61], [62]. For a given pair of prediction matricesÂ However, imbalanced training data will negatively impact the classification decision boundary, and a strong bias toward the more populated classes will exist. To address this problem, we propose a weighted loss function by exploiting the TF-IDF algorithm originally developed for document search and information retrieval [63]. Fig. 3 illustrates the decision boundary before and after applying the weight adjustment in a hypothetical example. As depicted in Fig. 3, the weighted loss function usually focuses on the important data samples while shrinking the decision boundary toward the center of gravity of each class. Furthermore, the TF-IDF algorithm assigns different weights to the contributions from different classes in its loss function based on the frequency and importance of the classes [64]. More specifically, the weighting coefficient of a word in a set of files (also known as a corpus) in TF-IDF is positively proportional to the frequency of its appearance in one file but inversely proportional to the number of files containing the word in the corpus. Thus, the TF-IDF algorithm generates a larger weighting coefficient for a given word if it appears frequently in one file but rarely in other files. Inspired by the TF-IDF algorithm, we treat each class in a set of remote sensing images as one word in a corpus.
If pixels corresponding to one class appear more frequently in one image but rarely in other images, then a larger weight is assigned to their contribution to the loss function. d the total number of pixels in X (c) k that belong to the j-th class; that is , where N is the total number of pixels with well-defined labels in the kth image. Now, we define the importance of the samples that belong to the jth class in the k-th image as follows: After normalizingω k,j , the normalized weighting coefficient for the samples that belong to the j-th class in the k-th image can be expressed as Finally, we propose the following ICE approach as the cost function for X (c) k using the following TF-IDF-based weighting coefficients: In the following, H ICE , which is defined in Eq. (6), is referred to as ICE. It is worth noting that the contributions from pixels associated with the less populated classes, such as ''Urban land'', are more heavily weighted in ICE compared to those pixels from the more populated classes, such VOLUME 10, 2022 k . However, H CE is not a good performance metric for data with weak labels, as its corresponding A (b) k is prone to errors. Fig. 4(a) illustrates a hypothetical example with three classes of weakly labeled data. Fig. 4(b) shows the decision boundary if the correct data labels are used. In contrast, if the classifier is trained to minimize H CE , then the resulting classifier may mistakenly categorize the lower two classes of data samples into one, as shown in Fig. 4(c)].
Inspired by the observation that inconsistent labels arise owing to the mixed spectral characteristics of several land-use classes, we propose to maximize the rank of the resulting prediction matrixÂ (b) k . For instance, we consider the following two prediction matrices denoted byÂ 1 is more advantageous as a prediction matrix.
Unfortunately, the maximization of rank(Â (b) k ) is nonconvex. Thus, it is nontrivial to directly maximize the rank of the prediction matrix. To address this problem, we propose to maximize the nuclear norm ofÂ as follows: It is worth noting that the nuclear norm is essentially the convex envelope of the matrix rank [65]. Nuclear norm-based optimization has been used for matrix completion and robust principal component analysis (PCA) [66]- [68]. Recall that the nuclear norm ofÂ (b) k is the sum of its singular values, and we can consider Â (b) k ν to be the approximation of rank(Â (b) k ). Thus, the maximization of the nuclear norm of A (b) k can effectively increase the number of predicted classes that can be identified in the remote sensing data, which can be translated into classification performance improvement. Furthermore, it has been shown that [69] 1 √ Q k and is defined as follows: Thus, the maximization of Â (b) k ν effectively increases the upper and lower bounds of Â (b) k F , as shown in Eq. (12). We recall that Â (b) k F is inversely related to the entropy ofÂ k , which is given by Therefore, an increase in Â (b) k F leads to a reduction in H E (Â k , which contributes to the improvement in the classification accuracy.

C. THE ''UNKNOWN'' CLASS
Conventional classifiers are designed to adjust their decision boundaries to accommodate all pixels regardless of the confidence levels of the data labels. As a result, conventional classifiers suffer from poor generalization capabilities, as they are forced to accommodate data with noisy features. Motivated by this observation, we propose the creation of an additional artificial class called the ''unknown'' class to handle data with weak labels, that is, X (b) k . Therefore, weakly labeled data with atypical spatial features can be classified into this new class without overfitting the classifier, as shown in Fig. 5. As shown in the later experimental results, the new ''unknown'' class can help expedite the training process by preventing overfitting. With the additional ''unknown'' class, the nuclear norm ofÂ (b) k takes the following form:

whereÂ
is the prediction matrix for the N C + 1 classes. In the following, we use the nuclear norm ofÂ (b) k defined in Eq. (15) as the loss function for the weakly labeled data in our proposed classifier.

D. SEMISUPERVISED LEARNING
In recent years, semisupervised learning has attracted wide attention from scholars in the field of CV. The core idea of semi-supervision is to use a small labeled data set to define features and then use unlabeled data to enhance the classifier's ability to understand features. Many semisupervised learning methods are proposed based on intelligent data enhancement strategies such as RandAugment [70] or AutoAugment [71], such as MixMatch [72] method, and Unsupervised Data Augmentation [73]. Recently, there has been widespread concern about the use of pseudo-marking and consistent regularization. FixMatch [74] has achieved state-of-the-art results on four benchmark data sets. The above research has a guiding role in the application of semisupervised learning in remote sensing. Semisupervised learning is ideal for land use classification because of the low cost of acquiring remote sensing images. The application of semisupervised learning technology in satellite remote sensing land classification is still in the development stage. Experiments are only performed on some simple datasets and have not been applied to actual scenes on a large scale, such as image classification [75]- [79] and information extraction [80]- [83]. In large-scale scenarios, three assumptions of semisupervised learning cannot be satisfied if unlabeled samples are added blindly.
We used the logic of semisupervised learning to cope with the noise of remote sensing data. We pre-circle out untrusted regions in the data whose pixels will not be involved in the cross-entropy calculation of the loss function but enter into an unsupervised computational process. Our scheme satisfies the three assumptions of semisupervised learning because the information involved in semisupervised learning comes from the same sample. The pseudocode is as shown in Algorithm 1.

Algorithm 1 Land Use Classifier Based on Semisupervised Learning
Require: The dataset images X into two sets, namely X (c) with well-defined labels A (c) and X (b) for those pixels with weakly labels. Require: Learning rate and initial parameter θ The maximum number of categories is C N +1:Â t ← f (X t ; θ) 4: Divide the prediction matrix into two sets: t , W t , via Eq.6 8: Num. of pixels with weakly labels in batch t. 9: Computr gradient extimate:ĝ ← +∇ θ L 11: Apply update: θ ← θ − ĝ 12: end for Fig. 6 shows a flowchart of the proposed semisupervised classification framework. More specifically, the proposed classification framework can be divided into three components: inference, training, and data preprocessing. During the inference process, the backbone of the proposed classifier is trained with well-known semantic segmentation models, such as FCN and DeepLabV3+, using our proposed cost functions. The supervised and semisupervised learning modules share this backbone in the proposed classifier. For data preprocessing, the proposed classifier defines the areas of large uncertainty around each cluster with a width of m pixels, as inaccurate labeling mainly occurs in the boundary areas of different land-use clusters. Note that the parameter m can be adjusted according to the noise level of the given dataset. The detailed structure of the training process is shown in Fig. 7. For the data with accurate labels in X (c) k , the following VOLUME 10, 2022  ICE-based cost function is used to evaluate the prediction performance, as shown in Eq. (6): In contrast, for the data with weak labels in X k , the following nuclear norm-based cost function is utilized: Note that the label information for weakly labeled data is discarded in Eq. (17). Therefore, Eq. (17) represents a cost function for unsupervised learning. In summary, the cost function proposed by combining Eq. (16) and Eq. (17) can be expressed as where λ is a parameter designed to adjust the contribution of L NuNorm to the cost function. The proposed cost function is formulated to enhance the generalization capability of the classifier by minimizing the negative impact due to the weakly labeled data while expediting the training process by adding an ''unknown'' class to include pixels of high uncertainty.

V. RESULTS AND DISCUSSIONS
In this section, we show the effect of our scheme on the classifier through different experiments. Experiment 1 compares the prediction results of different classifiers on cloud-cover images, and demonstrates the improvement in terms of classification robustness. In Experiment 2, we consider a chaotic dataset whose quality is closer to that of datasets in industrial applications. Our experimental results prove that our scheme can significantly improve the classifier's forecast accuracy and generalization ability.

A. EXPERIMENTS ON THE AIS DATASET
Experiment 1 is designed to show the improvement in terms of classification robustness. The aerial image segmentation (AIS) dataset was used as the baseline dataset, and a ''Damaged dataset'' was created. Three classifiers are trained in this experiment: • When ''deepLabeV3plus+cross-entropy'' is used to train on the baseline dataset, it is called the ''Baseline'' classifier.
• When ''deepLabeV3plus+cross-entropy'' is used to train on the damaged dataset, it is called the ''Damaged'' classifier.
• When ''deepLabeV3plus+our loss function'' is used to train on the damaged dataset, it is called ''Our'' classifier. Implementation details: We use gray patches of size 512 × 512 as inputs. Furthermore, we utilize the Adam optimizer with parameters of α = 0.0001 and β 1 = 0.9 and β 2 = 0.99. The training procedure follows the minibatch strategy, and the batch size is 8. All the networks in the experiments are implemented using the PyTorch platform and trained with an NVIDIA GeForce RTX 3080TI GPU.

1) DATASET
The AIS dataset contains labels for buildings and roads in Berlin, Chicago, Paris, Potsdam, and Zurich. This experiment used the Zurich data as the baseline dataset. The Zurich AIS dataset contains 364 samples, and we downsampled them to 512 × 512 pixels. Finally, 14000 reliable samples were adopted in this experiment. The 14000 samples were divided as follows: 10000 samples were used for training while 4000 for validation.
We altered the 40% training set to create the ''Damaged'' set. There are two ways to add noise to the training set, as shown in Fig. 8. First, we randomly ''damaged'' the labels of 170 × 170 pixels to simulate label noise. Second, we randomly broke 170 × 170 pixels in the image to simulate cloud cover. Fig. 9 shows the percentage of error samples in the training set. There are 1300 samples with noisy labels, 1300 samples with noisy pixels, and 1400 samples with both kinds of noise. The remaining samples are reliable.

2) RESULTS
Twenty training epochs were conducted for each classifier. Fig. 10 shows that the mean intersection over union (mIoU) of the validation set on the ''Baseline'' classifier is 0.6159 ± 0.0052, but that of the ''Damaged'' classifier is only 0.4328 ± 0.0089. This indicates that the degradation in classifier performance is caused by noise. Our scheme refined the mIoU to 0.5974 ± 0.0093, and it is shown that our scheme can provide a better quality classifier. Note that the accurate positioning of the error pixels is the key to the good results. Fig. 11 shows the image classification results on three classifiers. The classification effect of ''Our'' classifier is similar to that of the ''Baseline'' classifier, while the classification effect of ''Damaged'' classifier is the worst. These classification effects were obtained on images without noise information.  Some interesting classification results are shown in Fig. 12 in which noisy images are input into the three classifiers to evaluate their robustness. The ''Baseline'' classifier and ''Damaged'' classifier could not correctly process the abnormal features as they had to classify the abnormal pixels into the building, background, and road categories. In contrast, our scheme provides an ''unknown'' class that can be selected for abnormal pixels. As a result, our scheme demonstrated significantly improved robustness in classification.

B. EXPERIMENTS ON THE DeepGlobe DATASET
In this section, we validate our proposed semisupervised classifier through extensive simulations using the DeepGlobe Land Cover Classification Challenge dataset. We compare the classification performance of the four classifiers discussed above.
1) Supervised CE: The conventional supervised classifier based on the cross-entropy function proposed in [61], [62]. Note that we use this classifier as the baseline to benchmark our proposed classifiers.    The following experiments were implemented using the TensorFlow deep learning framework and performed on a computer equipped with GeForce RTX TM 2080 Ti. DeepLabV3+ was adopted as the backbone deep network, while minibatch gradient descent (MBGD) was employed as the optimization method with a batch size of 10 and a learning rate of 0.0001. Finally, 20 training epochs were conducted for each experiment.

1) DATASET
The dataset was originally designed for a multiclass segmentation task to detect cities, agriculture areas, pastures, forests, VOLUME 10, 2022  VOLUME 10, 2022 water sources, barren areas, and unknown areas. Similar to all other remote sensing datasets, DeepGlobe contains a large number of weakly labeled data. We preprocessed the Deep-Globe data by first downsampling its original image of size 2448 × 2448 to 512 × 512 pixels.

43446
Next, we discuss the selection of 7000 downsampled images to create our training and test datasets. As illustrated in Fig. 13(a), some images suffer from large labeling errors. For instance, even though the pixels within the two boxed areas in Fig. 13(a) have similar attributes, they were divided and classified into two different classes, namely, ''Water'' and ''Rangeland'', in the corresponding labels. Because dealing with large labeling errors is beyond the scope of this work, we excluded such images with large labeling errors from our datasets. In contrast, the weak labels in the boxed areas in Fig. 13(b) are more negligible, while the annotation in Fig. 13(c) is accurate. We included such images with either weak or accurate labels in our data sets. More specifically, we select 6500 images of 512 × 512 pixels and use 5000 for training and 1500 for testing. Note that these 6500 images may contain weakly labeled pixels. In addition, we manually chose 500 images with accurate labels to form another test dataset for performance analysis. In the sequel, this 500-image test set is referred to as the accurate-label test set, whereas the 1500-image test set is reffered to as the weak-label test set. It should be emphasized that the classes contained in these 5000 training images are highly imbalanced, as shown in Table 1, which shows the percentages of all land-use classes in the pixels in the selected training data sets. Clearly, the ''Agricultural land'' class substantially outnumbers the other classes.

2) RESULTS
Next, we first investigate the classification results using the weak-label test set. Fig. 14 shows that the performance of ''supervised CE'' is not satisfactory, as it failed to distinguish ''Rangeland'', which is colored pink, from ''Agricultural land'', which is colored yellow; this may be caused by the imbalanced data between these two land-use classes, as shown in Table 1. In contrast, the proposed ''supervised ICE'' can significantly improve the classification accuracy of ''Rangeland'' by considering the imbalanced data problem. However, ''supervised ICE'' cannot properly classify the boundary regions between two adjacent classes. This shortcoming was overcome by the proposed semisupervised learning method based on the nuclear norm. An inspection of Fig. 14 suggests that ''semisupervised CE+NuN'' can  better handle the boundary regions with the ''unknown'' class colored in black. Finally, Fig. 14 reveals that ''semisupervised ICE+NuN'' can further improve the classification accuracy. Because such boundary regions randomly appear in different land classes, their unrepresentative spatial characteristics confuse the learning process and slow down the convergence process. Therefore, the proposed semisupervised learning method can reduce the interference caused by these pixels by classifying them into one new land class. Fig. 15 magnifies the bottom area near the water body of the image shown in the first column of Fig 14. From the magnified image, we can see that the area immediately below the water body should not be classified as ''Agricultural land.'' Interestingly, the proposed ''semisupervised ICE+NuN'' method classified this area as ''Barren land,'' whereas ''supervised CE'' classified it as ''Agricultural land.'' We believe the classification of ''Barren land'' is more accurate, as vastly different spatial features can be observed between this area and its neighboring ''Agriculture land'' even by a visual inspection.
Furthermore, we can observe from the results presented in the second column of Fig 14 that the proposed ''semisupervised ICE+NuN'' method has identified substantially more ''Rangeland'' areas than ''supervised CE.'' To validate this observation, we magnified the center part of the image, as shown in Fig. 16. First, we observe from the boxed area labeled ''Reference'' that both ''Supervised CE'' and ''Semisupervised ICE+NuN'' classified this area as ''Rangeland''. This classification result also agrees well with the label, as shown in Fig 14. In contrast, a visual inspection suggests that the boxed area labeled ''(a)'' actually contains very similar spatial features. However, ''Supervised CE'' classified this area as ''Forest land'', while ''Semisupervised ICE+NuN'' classified this area as ''Rangeland.'' With  the information in the boxed area labeled ''Reference,'' we believe ''Rangeland'' is a more appropriate land-use classification for the boxed area labeled ''(a)''. Similar observations can be made in the other figures. Table 2 shows the mIoU performance of the four classifiers on the accurate-label test set. Thus, classifiers with ICE can more effectively lessen the adverse effects caused by imbalanced data than their counterparts with conventional CE. For instance, the mIoU for ''Barren land'' was improved from 43.13 (''Supervised CE'') to 56.74 (''Supervised ICE'') by exploiting ICE. Similar observations can be obtained for ''semisupervised CE + NuN'' and ''semisupervised ICE + NuN.'' Furthermore, the proposed semisupervised learning method helped further improve the recognition accuracy of the supervised learning-based classifiers. In particular, compared to the conventional ''supervised CE'' method, the proposed ''semisupervised ICE+NuN'' demonstrated impressive performance gains of the order of 10% for the three least represented land-use classes, namely, ''Rangeland'', ''Water'', and ''Barren land''. In addition, the mIoU performance for the three most represented classes is comparable for the four classifiers on the test dataset with more accurate labels. Fig. 17 shows the mIoU performance as a function of the iteration number for the four classifiers under consideration. An inspection of Fig. 17 reveals that the two proposed semisupervised classifiers achieved faster convergence, as the nuclear norm can remove the interference caused by the unrepresentative features from the mixed pixels, particularly the ambiguous features arising from the junction of multiple classes. Using the unsupervised learning technique, the proposed classifier can spend less time learning invalid or even incorrect features, which shortens the training time without overfitting the proposed classifier.
Finally, we compared the performance difference of using different convolutional neural network (CNN) models, including U-Net, FCN-8s, DeepLabv3, FPN, and DeepLab3+. The quantitative results shown in Table 3 indicate that DeepLab3+ generally exhibits the best performance. In addition, regardless of the CNN model, the proposed ''semisupervised ICE+NuN'' model outperformed the conventional ''supervised CE'' model by 3% − 5%. As shown in the experiments, the proposed ''NuN'' cost function worked well with any existing CNN model..

C. DISCUSSIONS AND FUTURE WORK
It has been a long-standing problem that remote sensing data suffer from much larger uncertainty than data in other research areas, such as CV, which has become a major challenge for researchers applying machine learning techniques to remote sensing data. In this work, we have made an initial attempt to open a new avenue for handling uncertainty by recognizing that most pixels in remote sensing images may exhibit characteristics of multiple land-use classes.  Thus, in lieu of forcibly classifying the pixel into one specific land-use class, it is more appropriate to classify the mixed pixels into multiple classes using the proposed unsupervised approach. Furthermore, if the pixels show unrepresentative characteristics, we propose classifying the pixels to an ''unknown'' class to accommodate these indistinguishable pixels. Thus, the proposed semisupervised classifier analyzes the uncertainty associated with each pixel before applying unsupervised learning to pixels with high uncertainty. As a result, our proposed semisupervised learning approach has a better generalization capability, more robustness, and faster convergence.
As discussed before, remote sensing images with large labeling errors are beyond the scope of this work. In the future, we plan to extend the current work to these images. Furthermore, because our proposed semisupervised learning approach can relax the stringent requirements for accurately labeled data, it may be possible for the proposed approach to further reduce its dependence on accurate labels. Fig. 18 shows some interesting observations derived from our experiments on images with large labeling errors.
More specifically, the pixels in Fig. 18(a) belong to ''Agriculture land,'' ''Rangeland,'' and ''Forest land.'' However, the labels corresponding to ''Rangeland'' and ''Forest land'' were largely mistaken. Similarly, the labels shown in Fig. 18(b) also exhibit large errors because the features of water, trees, and houses were ignored in the labels. These large labeling errors may mislead the training of classifiers, especially for ''supervised learning,'' if they are contained in a training dataset. Interestingly, even though the labels were largely mistaken, the proposed ''semisupervised ICE+NuN'' method was able to accurately identify the ''water'' pixels colored in blue and the details in the top image. Furthermore, for the natural environment shown in the middle image, transitional areas between ''Forest land'' and ''Rangeland'' were correctly recognized. This suggests that the nuclear norm can prevent the proposed classifier from overfitting, particularly when the training dataset has a large domain difference and imbalanced data.
Another interesting observation about the proposed semisupervised classifier is shown in Fig. 19, in which the class of ''Water'' was clearly recognized. Because the features of ''Water'' are vastly different from those of other land-use classes, the proposed classifier could accurately identify ''Water'' even with limited information provided by the dataset. However, Fig. 19 also shows that the proposed classifier was not quite able to distinguish ''Shadow'' from ''Water'', as these two classes demonstrate very similar spatial features that are difficult to differentiate even by visual observation. This issue can be an interesting extension of this study and can be further explored.

VI. CONCLUSION
In this study, we developed a semisupervised classifier using a small set of remote sensing data with accurate labels and remote sensing data with weak labels. A weighted cross entropy-based cost function was proposed to circumvent the imbalanced data problem by utilizing the term frequency-inverse document frequency (TF-IDF) algorithm to weigh the contributions from imbalanced data of different classes. In addition, a nuclear norm-based cost function was developed to maximize the rank of the prediction matrix derived from the weakly labeled data without requiring data labels. Furthermore, an artificial class called ''unknown'' was created to alleviate the interference caused by weakly labeled data with unrepresentative spatial features. Extensive experiments were performed using the DeepGlobe Land Cover Classification Challenge dataset and the AIS dataset. The experimental results confirm the effectiveness of the proposed semisupervised classifier.