Self-Filtered Learning for Semantic Segmentation of Buildings in Remote Sensing Imagery With Noisy Labels

Not all building labels for training improve the performance of the deep learning model. Some labels can be falsely labeled or too ambiguous to represent their ground truths, resulting in poor performance of the model. For example, building labels in OpenStreetMap (OSM) and Microsoft Building Footprints (MBF) are publicly available training sources that have great potential to train deep models, but directly using those labels for training can limit the model's performance as their labels are incomplete and inaccurate, called noisy labels. This article presents self-filtered learning (SFL) that helps a deep model learn well with noisy labels for building extraction in remote sensing images. SFL iteratively filters out noisy labels during the training process based on loss of samples. Through a multiround manner, SFL makes a deep model learn progressively more on refined samples from which the noisy labels have been removed. Extensive experiments with the simulated noisy map as well as real-world noisy maps, OSM and MBF, showed that SFL can improve the deep model's performance in diverse error types and different noise levels.


I. INTRODUCTION
S EMANTIC segmentation of remotely sensed data is important for a wide range of geographic and environmental studies. Among the various applications, building extraction is one of the most popular applications, as it is helpful to support urban planning [1], disaster management [2], and humanitarian aid [3], [4]. There have been significant advances and interests in building extraction algorithms due to several contests [5], [6], [7]. However, the accurate and automated extraction of buildings remains a challenging task.
In recent years, numerous deep learning-based algorithms have been proposed for building extraction and have shown promising results compared with conventional machine learning algorithms and unsupervised algorithms, if provided with a sufficient number of high-quality training data [5], [6], [8]. However, obtaining plenty of high-quality building labels is time-consuming and expensive. Although there have been studies to resolve the problem of shortage of training samples with domain adaptation [9], [10], transfer learning approach [11], and semisupervised learning [12], [13], [14], there are still limitations in their accuracy as the representation of the buildings varies spatiotemporally [8], [15]. Thus, successful deep learning requires abundant, precise, and repetitive annotations, while annotation requires cost, necessitating a compromise between performance and cost.
There are several types of open training sources for building labels. The training sources can be classified as follows: map by experts [7], [16], map by algorithm [17], [18], and crowdsourced map [19]. A map by experts refers to a hand-digitized map by experts for cadastral mapping or research purposes. They are relatively more accurate than other types of maps but are often limited to a few small areas [7], [16]. The second type is a map generated by algorithms. Microsoft Building Footprints 1 (MBF) is an example of such data. Finally, a crowd-sourced map is a map created collaboratively by crowds. OpenStreetMap (OSM) [19], [20] is the most popular crowd-sourced map. The OSM building label is frequently used as a training source in numerous studies and is widely adopted as a starting point for precise building map production [11], [21], [22], [23].
Despite the presence of publicly available building maps, these are not often sufficient and are not reliable training sources to train a deep model due to the respective reasons. First, the map by experts covers only a tiny part of the world, so the model trained with the map hardly performs well when applied to the different regions [8], [9], [10], [15]. Second, the map by algorithm often contains errors [18], and the quality has not been thoroughly verified in most cases. Finally, the crowd-sourced map contains various kinds of errors, as it is essentially created by crowds without solid quality control [19]. Another unique feature of the crowd-sourced map is that its completeness varies regionally [24] as the mapping areas and buildings are chosen by random volunteers. In addition, it is difficult to know the vintage of the building footprints as the imagery used for the annotation is unknown [25], [26].
In short, in publicly available building maps, there is always a tradeoff between the diversity of building representations and the correctness of labels. The map by experts is highly accurate but has less diverse building features, whereas the crowd-sourced map has good diversity but is less accurate. Hence, challenges in utilizing publicly available building maps as a training source can be summarized into two problems: 1) the problem of limited representation and 2) the problem of label noise.
To address the problem of limited representation, many studies have addressed this issue through domain adaptation [10], [27], transfer learning [11], and semisupervised learning [12], [13], [14]. Those approaches generally assume two different domains, one of which is the region where the training samples are collected, the source domain, and the other one is a new region to map, the target domain. Then, the goal is to bridge the data distribution gap between the two domains or to use the knowledge of the source domain for the target domain [27].
On the other hand, the problem of label noise is generally dealt with the paradigm of training with noise or, more comprehensively, weakly supervised learning (WSL) [28]. Recently, the label noise problem has received a lot of attention associated with deep learning. This is because a large amount of training data becomes more easily obtainable and allows the model to solve complicated problems [29], but the training data may contain noisy and harmful data, leading to weak supervision. Thus, it is not surprising that developing a novel learning system that can exploit weak and noisy labels becomes a more popular and practical solution [30].
In particular, building labels, such as crowd-sourced maps (e.g., OSM) or machine-generated maps (e.g., MBF), are widespread, and they could become more powerful training resources if an effective method is developed to utilize those labels. However, in the field of remote sensing, only a few studies have addressed the problem of label noise associated with deep learning.
Based on this observation, this article aims to improve learning ability of deep model on noisy building labels. We present a simple but effective learning method for noisy training samples, entitled "self-filtered learning (SFL)." The noisy sample refers to inaccurate or incomplete samples, consisting of label noise, which impedes the successful learning of the model. SFL iteratively filters out noisy samples during the training process by ranking the losses of the samples and lets the deep model progressively learn the refined samples throughout multiple rounds.
The rest of this article is organized as follows. Section II summarizes recent studies on WSL focusing on the field of remote sensing. Section III presents SFL. Section IV illustrates the datasets and experimental methods. Section V discusses the performance of SFL, analyzed with results from different noise scenarios, results from different architecture, results from different combinations of parameters, and so forth. Finally, Section VI concludes this article.

II. WSL IN REMOTE SENSING
With the significant development of deep learning technology, achieving an acceptable level of performance has become doable, provided that there are sufficient high-quality ground-truth labels. However, obtaining high-quality labels, i.e., strongly labeled samples, requires a lot of time and effort. Instead of strongly labeled samples, WSL is a method to learn a model with weakly labeled samples [28]. Weak supervision refers to the case where the model learns with a low degree of training samples in terms of their quantity and quality, as opposed to strong supervision whose training samples are assumed to be sufficient to represent the target distribution. Thus, the goal of WSL is to build a predictive model that performs as well as a strongly supervised model even with relatively inexpensive and imperfect samples.
Inexact supervision refers to the case where low-level annotations are provided. For an example of the object detection task, it is desirable to annotate all individual instances for input labels, but inexact supervision usually provides only an imagelevel annotation instead of instance-level annotations. Inaccurate supervision indicates the situation that given labels are not always ground truth. Such a situation occurs when the annotator fails to label accurately. Finally, incomplete supervision relates to situations where only a subset of training data is labeled. Different types of weak supervision often occur simultaneously in real practice, and careful consideration of a given task and label characteristics is crucial to developing a successful WSL algorithm.
In this article, a review focused on WSL in the field of remote sensing, especially object detection and semantic segmentation, which are closely related to the building extraction task. With careful review, we found that there is no sharp boundary between inaccurate supervision and incomplete supervision for semantic labeling of remote sensing images, in that both supervisions usually assume supervision with noisy maps, such as crowdsourced maps. Therefore, we combine inaccurate and incomplete supervision into one and refer to it as noisy supervision. The following sections illustrate the related works for the two branches: 1) inexact supervision and 2) noisy supervision.

A. Inexact Supervision
The most common case of inexact supervision in remote sensing is the object detection task, where only an image-level annotation is provided. A desirable annotation for strong supervision in object detection is generally a bounding box that describes the location and size of a target object, known as instance-level annotation or object-level annotation. However, weak supervision usually assumes a situation where only the presence of the object is provided, known as an image-level annotation. Although inexact supervision also includes other types of annotation, such as point-level annotation [31], [32] and scribble or sparse annotations [33], [34], most studies assume an image-level annotation case that is the lowest level of inexact supervision [35], [36].
For weakly supervised object detection (WSOD), many effective algorithms have been developed based on the multiple instance learning paradigm [37]. The general approach for WSOD consists of two stage. The first part is to initialize an object's location through the region proposal, and the second part is to optimize the object detector by iteratively training and localizing. There have been studies [35], [36] to detect targets, such as airplanes and ships, with an image-level annotation; however, these algorithms are not appropriate for cadastral mapping, as their output is a bounding box.
For building extraction, instead, weakly supervised semantic segmentation (WSSS) is more appropriate, as identifying the pixel that belongs to the building is more desirable than depicting the bounding box of the building. The general step of WSSS is similar to the two-stage WSOD in that it first proposes a pseudomask and the mask is iteratively refined, but WSSS uses a semantic segmentation model instead of an object detector. Only a few studies have been conducted [38], [39] that performed building extraction with an image-level annotation. Both Chen et al. [38] and Li et al. [39] used a class activation map [40] to generate the pseudomask, then the semantic segmentation model was trained with the generated pseudomask to extract a more refined building mask. Although Chen et al. [38] and Li et al. [39] effectively reduced the annotation cost with the WSSS method, their results were much less accurate than those under strong supervision and performance could easily be degraded if the input does not cover the entire building or if the input covers a dense urban area, as observed in [39].

B. Noisy Supervision
In remote sensing, a more detailed level of supervision than inexact supervision can be easily obtained. A typical example is a crowd-sourced map, such as OSM. Although it contains a lot of label noise, it provides a pixel-level annotation. Therefore, noisy supervision (e.g., supervision with OSM label) has great potential to be a more effective way to produce satisfactory results than inexact supervision for building mappings. Despite its potential, only a few studies for noisy supervision can be found in remote sensing literature.
A straightforward way to address the issue of label noise is to refine the noise of the label noise as a preprocessing step. Yuan and Cheriyadat [41] and Vargas-Muñoz et al. [42] refined the misalignment errors of the OSM label by maximizing its correlation with the RGB image and the CNN output, respectively. Griffiths and Boehm [43] utilized a DSM to refine the overgeneralized building boundaries of the building label in OSM. Although the label-refining methods successfully refine the crowd-sourced map as a preprocessing step, the refining often requires particular conditions, and there were still huge margins between the qualities of the refined label and the ground truth.
Other approaches include using either noise-robust architecture or transferred knowledge. Mnih [21] incorporated noiserobust loss functions into the CNN architecture to reduce the effect of omission and registration errors in OSM. Inspired by [44], Zhang et al. [45] and Li et al. [46] adopted an extra noise adaptation layer to model a correlation between clean and noisy labels. They successfully improved the performance of the baseline model's performance with the supervision of noisy OSM labels. However, as the noise transition matrix heavily depends on the noise label distribution assumption, these methods often require either an initialization of the matrix with a pretraining setup or a subset of verified clean labels.
Another line of approach is either to use transferred knowledge or to simply use a sheer volume of crowd-sourced data. Maggiori [11] first pretrained a fully convolutional network [47] (FCN) with raw OSM and then fine-tuned the model with a relatively small amount of clean labels. Although [11] significantly reduced labeling cost by applying a common practice of transfer learning, it requires some portion of clean verified labels, similar to the assumption of the semisupervised learning setup [12], [13], [14]. [23] also confirmed that pretraining with noisy OSM labels benefits deep learning performance and further found that a large amount of OSM labels can be beneficial compared with a small number of clean labels. However, the deep model trained with OSM labels produced significantly poorer performance compared with the case trained with the same amount of clean labels. Moreover, Kaiser et al. [23] found that the model, even if trained with a fairly large number of training samples, cannot produce satisfactory results unless the model is fine-tuned to the target area. These results tell us that noisy maps have great potential, but at the same time, noisy supervision has a limitation in its performance compared with strong supervision.

C. Summary of Related Works
To sum up, considering the increasing number of crowdsourced maps and machine-generated maps, we argue that WSL, particularly noisy supervision, would be an effective and practical method. However, research on noisy supervision in remote sensing is still at a very early stage, and there is a significant gap between weak supervision and strong supervision in their performance.
Meanwhile, there have been numerous efforts to deal with the problem of label noise in computer vision communities. Regularization techniques [48], [49], loss adjustment [50], [51], and sample selection [52], [53], [54], [55], [56] are examples. Among them, sample selection-based approaches try to extract noise-free knowledge from the dataset by monitoring loss dynamics [54], utilizing auxiliary network [52], [53], or iteratively refining samples [55], [56] assuming that small-loss samples are likely to be clean samples. Most of previous works have been developed for the classification problem as the definition of clean and noisy labels is very clear in the classification problem [57], [58]. We assumed that a sample selection-based approach could be effective for building extraction tasks as well, even though it is a semantic segmentation problem. This is because the distribution and severity of label noise are significantly different region-by-region in noisy building maps. Thus, we thought that a deep model would be able to identify relatively clean samples if samples were randomly cropped from noisy building maps.
Our work aims to improve the learning ability of the deep model against noisy building maps. Sharing the key assumption with a sample selection-based method that a noisy sample tends to exhibit high loss, we propose an SFL that enables the model to learn noisy maps in a more effective way. SFL iteratively filters out noisy samples during the training process based on the losses of randomly cropped samples, so that the model is progressively trained on more refined samples. As the proposed method is a learning method, it is compatible with prior efforts, namely, label-refining methods and noise-robust architectures. Moreover, it does not require either clean labels or extra specifically designed layer or loss functions. It simply leads the deep model to train with more refined samples based on the heuristic rule, but it is significantly effective.

III. SELF-FILTERED LEARNING
SFL is a learning method that improves the learning ability of the deep model for noisy building maps by iteratively filtering out noisy samples based on the rank of the loss during the training process. SFL exploits the characteristics of noisy building maps that the severity of noise is significantly different locally, and thereby, it can effectively leverage the sample selection approach although it is a semantic segmentation task. The objective of typical supervised learning is to find the best mapping function f θ parameterized by θ with the given dataset Then, the learning of a deep model can be seen as a finding optimal parameterθ that minimizes the summation of losses of total samples, where is the per sample loss function.θ However, if the given labels contain noiseỹ i , the dataset turns Then, the learning leads to suboptimal parameterθ noise , which is not equal to the optimal parameterθθ noise = arg min Thus, SFL aims to find theθ SFL that approximates to the optimal parameterθ by filtering the input used for the total loss Algorithm 1: Self-filtered learning. Input: 4: Remove train sample whose loss is in the top calculation. SFL changes the objective function as follows: B(x i ,ỹ i ) is a binary operator that returns a value of 0 or 1 to decide whether to include the given (x i ,ỹ i ) for the total loss calculation. If a binary operator returns 0, the sample will be filtered out and its loss will not affect the gradient descent update; otherwise, the sample will be used for the parameter update.
To be specific, SFL determines the binary operator for every epoch E considering the loss of each sample, the convergence rate, and the three hyperparameters (i.e., CRT, MFR, and SL).
The following three hyperparameters are introduced to consider 1) when to filter samples, 2) how many samples to filter, and 3) when to stop filtering, respectively. First, the CRT determines when to filter samples. The rationale for CRT lies in the assumption that the model is not sufficiently trained when the validation loss changes rapidly. Especially for the very early stage of training, the model is not likely to be trained enough to distinguish noisy samples. Also, when the validation loss fluctuates significantly, it is not guaranteed that every renewed model has better distinguishability. Therefore, to prevent the model from mistakenly removing clean samples due to its immature or unstable state, SFL calculates the convergence rate of the model as the decreasing rate of the minimum total validation loss. The total validation loss refers to the sum of the losses of all remaining validation samples for each epoch.
A low convergence rate means that the model is at a loss plateau in gradient descent training, indicating that the model has converged enough or been stagnant. Therefore, we make SFL filter samples whenever 1) the total validation loss decreases and 2) its convergence rate is lower than the certain threshold (CRT). These conditions allow the model to filter samples only when the model certainly improved its performance, but the model is at a loss plateau.
The next thing to consider is how many samples should be filtered-out at each epoch. When the convergence rate is too low, it is better to remove more samples to boost filtering as well as learning. However, if the SFL removes too many samples at once, it can remove clean samples together. To adjust the number of samples to be removed at each epoch, the second parameter, MFR, is introduced. Specifically, SFL linearly adjusts its filtering rate from zero to MFR. The SFL's filtering rate is zero when the model's convergence rate is equal to CRT, whereas the filtering rate is at its maximum (MFR) when the convergence rate is zero. It is worth mentioning again that SFL does not remove any sample when the total validation loss increases after an epoch (refer to 5 in Algorithm 1). In addition, inspired by a cyclical learning rate [59], SFL initializes its learning rate whenever the model obtains the lowest total validation loss so that the model can traverse loss plateaus more rapidly. Furthermore, it can prevent the model from being fitted to noisy samples in an early stage, the so-called memorization effect [60]. As a result, SFL adaptively regulate the filtering rule considering the dynamic of the learning process and the predetermined CRT and MFR.
Finally, the third parameter, SL, was introduced to limit the maximum number of total samples removed from the validation set. Thus, SL prevents the model from removing unnecessarily large number of samples. If the proportion of removed samples reaches SL, the binary operator B in (3) stops changing and SFL continues with the training until it meets the stopping criteria.
Therefore, the objective of SFL at epoch E is to findθ SFL E that can be formulated as follows: where B E is the binary operator for epoch E.
To sum up, SFL manipulates the input stream for every epoch to lead the model to have theθ SFL that approximates to the optimal parameterθ considering the sample loss, the convergence rate, and the predetermined hyperparameters (i.e., CRT, MFR, and SL). Since the rank of the loss is used for the filtering, any kind of loss function that measures the difference to the given label can be used. In this article, we used binary cross-entropy loss.
The default values for the hyperparameters were determined heuristically based on the observation of the experimental results and loss dynamics of the deep model. CRT was set to 0.2 by default, which means that SFL filters the samples whenever the newly obtained lowest total validation loss and the previous lowest total validation loss differ by less than 20%. MFR was set to 0.1 by default. Finally, SL was set to 0.3 by default. Although the optimal values for the hyperparameters must be different according to the datasets, we found the performance is not susceptible to the parameter values as long as the parameter values are within reasonable ranges. Also, we found filtering samples more aggressively can further improve the performance of SFL when training samples are sufficient. A detailed discussion of parameter selection is provided in Section V.

IV. DATASET AND EXPERIMENTAL SCHEME
We experimented with three different types of noisy maps, namely a simulated map, OSM, and MBF. Each map has the corresponding aerial orthophotograph and hand-digitized clean ground truth. Basically, both training and validation sets were collected from the noisy map to train the deep model, and the trained model with the noisy map was evaluated with the clean, separate ground truth map. A detailed description for each dataset and the experimental scheme is provided in the following sections.

A. Simulated Map
The simulated map is created by adding the noise to the fully hand-digitized WHU aerial dataset described in [16]. The original WHU dataset consists of RGB aerial orthophotographs with a spatial resolution of 0. . The original building footprint tile is rotated by a random degree between -15°and 15°. Note that, as the noisy samples were intentionally created by the simulation, each sample can be tagged whether it is a noisy or a clean sample, and the tag allows us to check whether the filtered-out sample by SFL is truly a noisy sample or not at the evaluation stage.
With the error simulation, diverse conditions of noisy maps were generated to evaluate the performance of SFL. First, we created four noisy maps containing only one of four error types, respectively, to find whether SFL performs well for all error types. For each noisy map, 20% of the total training samples contain errors. In other words, errors were added to 20% of randomly selected samples. Second, we created maps of different noise levels such that samples including simulated errors account for 0%, 10%, 20%, 30%, 40%, 50%, 60%, and 70% of the total training samples. The case of 0% is to assume the situation open data contains no error, which also can be referred to as strong supervision. For all noise levels except for 0%, four types of errors were evenly configured to have the same ratio. For example, if a total of 1000 training samples are used and the noise sample ratio is 40%, 400 out of 1000 samples contain errors, and each of the four types of errors in Fig. 2(a)-(d) accounts for 100 samples equally. It should be noted that the sample refers to the input unit of a deep model, and the noise sample ratio refers to the proportion of samples that contain any errors out of the total training samples. As a result, we created various types of noisy maps with different noise levels to simulate the common error pattern in building labels. With these simulated maps, we evaluated the performance of SFL under diverse conditions of model training.

B. Real-World Noisy Map
The OSM and MBF of the Purdue University area and the Indianapolis center were used for the experiment for the real-world noisy map. Both areas have corresponding orthophotographs for training and the hand-digitized ground truth map for the evaluation. The orthophotograph consists of RGB-NIR bands with a spatial resolution of 1ft by 1ft, and it is provided by the Indiana Geographic Information Office. All four bands of RGB-NIR were used for the experiments. Both OSM and MBF are noisy maps, including many omissions, commissions, and boundary errors, but the characteristics of the two data are quite different.
Since OSM has been produced by various volunteers around the world, it has an advantage in that it covers extensive areas consisting of various building traits [19], [20]. Several studies adopt OSM as reference data, as a starting point to produce a more accurate map, or as a training source to learn the machine learning model [11], [21], [22], [23]. However, its spatial coverage is heterogeneous regionally, and its quality and vintages of the building footprint vary even in adjacent areas, as the precision of each volunteer and the reference image used for mapping are different. These characteristics are also well represented in the OSM of the area used for the experiment.
On the other hand, MBF is a building footprint map generated by a combination of the supervised and unsupervised method. The supervised method is based on EfficientNet [61] trained with millions of labeled satellite images, and the unsupervised method is based on a polygonization algorithm that refines the building boundary of the output of EfficientNet. MBF covers several countries (i.e., the United States, Canada, and several African countries). The completeness and quality of the building footprint are relatively homogeneous compared with the OSM. However, similar to OSM, it contains errors [18], and the exact vintage of each building instance is often unknown, so there are many outdated labels. In particular, it consists of many boundary errors as it is produced using satellite images, which have a coarser resolution than the orthophotograph used in our experiment.
All real-world noisy maps used in this article are shown in Fig. 3. All include various types of error such as omission, commission, and boundary errors, but the characteristics of each data are quite different. For example, many building instances are not annotated in the OSM of Purdue University area compared with both the MBF of Purdue University area and the OSM of Indianapolis center. This is a typical feature of OSM that its completeness of rural areas is generally poorer than that of the metropolitan area as volunteers tend to map their local area [25]. In contrast, since MBF is generated by the algorithm, its completeness does not differ significantly regionally.

C. Experiment and Evaluation Methods
To generate training, validation, and test datasets, nonoverlapping 400 tiles of 512 by 512 pixels of an image and the corresponding building footprint map were cropped from each of the datasets (i.e., simulated map, OSM, and MBF) and divided into a ratio of 60:30:10. Although a larger number of training samples must be beneficial to deep learning, we cropped only 400 tiles (<10 km 2 ) for experiments to set up a realistic, data-hungry scenario. For the training and validation sets, the noisy map of each dataset was used, whereas, for the test set, the ground truth of each dataset was used.
U-Net was adopted as a baseline deep model, as it generally produced the highest performance in building detection challenges [5], [6] and has been widely adopted as a baseline model for building extraction studies [16], [62], [63], [64]. In our experiment, we compared three different learning methods to evaluate the performance of SFL. The first method is U-net with SFL ("SFL-U-Net"). The second method is also U-net, but trained with only the filtered training samples by SFL ("Filtered-U-Net"). Note that the second method is different from the U-net with SFL in that the second method does not train and filter the samples simultaneously but just trains from scratch with the filtered samples, which are a subset of all available samples. The third method is U-net without SFL ("U-Net (baseline)"). In other words, the binary operator B in (3) will change per epoch for SFL, whereas, for the second method, B will be a predetermined static function from the start and not change through the whole learning procedure, and for the third method, B will be a constant as 1 for every epoch. Thus, all three methods use U-Net as a baseline architecture and start training from scratch, but adopt different learning methods.
All U-nets in the three methods take an input size of 256 by 256 by N. N is the number of the band, so N is three for the simulated map and four for the real-world noisy map in our experiment. Also, an input was randomly cropped from 400 tiles for a minibatch gradient descent training. Adam optimizer [65] with an initial learning rate of 1e-3 and binary cross-entropy loss were used. The size of the minibatch was set as 16, and the early stopping with the patience of 10 was used. These hyperparameter values were set as default for both the simulated and real-world noisy maps unless otherwise noted. Other unique hyperparameters for SFL were set as a default, as described in Section III. The Intersection over Union (IoU) was adopted as the evaluation metric. To increase the statistical confidence of the results, a total of ten replicate experiments were conducted, and for each replicate, three methods share the same training, validation, and test datasets. The network is implemented in TensorFlow with a computer cluster equipped with multiple NVIDIA Tesla P100 and V100 GPUs.

V. RESULTS AND DISCUSSION
This section presents the results of the three methods (i.e., SFL-U-Net, Filtered-U-Net, and U-Net) in various scenarios, including simulated maps, the case when the clean dataset is given, and real-world noisy maps. The impact of key parameters is investigated. Also, results with different number of training samples and different architectures were also provided.

A. Performance of SFL: Simulated Map
There are three main advantages of using a simulated map. First, diverse types of noisy sample can be simulated. Second, the ratio of noisy samples throughout the entire training set is controllable. Finally, how accurately SFL filters noisy samples can be evaluated as the noisy samples were artificially generated by adding the noise to the original clean samples prior to training.
Taking advantage of these merits, we examined the semantic segmentation performance of SFL with different types of noise and different noise levels. Moreover, we assessed the filtering performance of SFL by introducing the two additional metrics (i.e., Filtering-IoU and Filtering Precision). The filtered-out sample set F (whose elements are removed samples from SFL) and the noisy sample set W (whose elements are noisy samples from the simulation) are prepared. Then, Filtering-IoU and Filtering-Precision can be computed by (5) and (6), respectively, where |A| denotes the cardinality of a set A Filtering-IoU measures the general filtering performance of the SFL, and Filtering-Precision calculates the ratio of noisy samples among the samples filtered out by SFL. For example, low Filtering-Precision indicates that the filtered sample set includes a high ratio of clean samples. Both metrics range from 0 to 1, and the higher value means better filtering performance.
1) SFL With Different Types of Noise: Four types of noisy maps which constitute one of each type of errors (i.e., omission, commission, boundary, and both omission and commission) were created. Each type of error constitutes 20% of the total samples, but it is worth noting that their difficulty levels are not exactly proportional to the ratio of noisy samples. Table I tabulates the results of the three methods according to the four different types of noise. The three methods were also carried out on the original ground truth data, strong supervision. The reason for the experiment with the original ground truth data is 1) to estimate the maximum obtainable IoU value, which is assumed to be an upper limit of SFL as well as WSL, and 2) to check whether SFL is harmful when the training set does not contain noise at all.
For all types of errors, SFL-U-Net obtained higher accuracies than U-Net, indicating that SFL successfully improves the performance of the baseline model. Filtered-U-Net also obtained higher accuracies as it makes U-Net start to train with refined training samples. Although the difference in performance between SFL-U-Net and Filtered-U-Net was not significant, we can argue that SFL-U-Net is a better choice considering the computation time, as Filtered-U-Net requires two times model training processes. Considering the IoU values obtained when the model was trained with the original ground truth data, it was found that SFL-U-Net nearly reached the maximum obtainable IoU value, and this successful performance is supported by the high Filtering-IoU and Filtering-Precision.
Among the four different types of errors, the performance of SFL-U-Net, as well as U-Net, was relatively poor in omission error, especially compared with commission error. The difference in performance between omission and commission might be attributed to the class imbalance problem, also known as the prior probability shift [66]. Since the ratio of the building class was lower, the probability output for the building class tends to be underestimated compared with the nonbuilding class. In particular, as the omission error will further decrease the building class ratio, the omission error could be more difficult to train the model. Our article experimented without any assumption about the distribution of building ratios and error types. However, if there is prior knowledge such that omission error dominates the noisy map, using a loss function to deal with class imbalance, such as weighted loss [67] and focal loss [68], or oversampling  I  COMPARISON OF SFL-U-NET, FILTERED-U-NET, AND U-NET WITH THE FOUR DIFFERENT ERROR TYPES   TABLE II COMPARISON OF SFL-U-NET, FILTERED-U-NET, AND U-NET WITH THE EIGHT DIFFERENT NOISE LEVELS a minority class [69] might be an additional remedy to improve performance.
Finally, the IoU values of SFL-U-Net and U-Net were not significantly different in the strong supervision case, as given in Table I. Showing only a marginal penalty in performance, despite applying SFL to strong supervision, suggests that SFL is robust and will not significantly harm the baseline performance even if the training dataset is perfectly clean.
2) SFL With Different Noise Levels: Table II tabulates the results of the comparison of the three methods with the eight different noise levels. The noise sample ratio represents the ratio of noisy samples in the entire training dataset. Four types of errors (i.e., omission, commission, boundary, and both omission and commission) are evenly comprising the noisy samples, as described in Section IV-A. The results of the strong supervision case in Table I are shown again as a noise sample ratio of 0% for convenience.
SFL-U-Net improved the baseline performance for all noise levels, including the case where noisy samples account for more than half of the training sample. Another notable thing is that the Filtering-IoU tends to decrease as the noise sample ratio increases, whereas the Filtering-Precision tends to increase. This opposite tendency of Filtering-IoU and Filtering-Precision suggests that "underfiltering" occurred as the noise sample ratio increased. The underfiltering might have occurred due to the upper limit of the number of removed samples determined by the parameter SL. As our experiment set SL as 0.3, SFL could not filter out more than 30% of the samples in the validation dataset, and consequently, SFL would have no choice but to stop the elimination of noisy samples in the validation dataset. Another reason for the underfiltering may be the model's overfitting to noisy samples before their removal.
The underfiltering issue might be relieved by tuning the parameters of SFL. For example, the higher MFR can reduce the possibility of underfiltering as well as prevent the possibility of the model's overfitting to the noisy samples because the higher MFR can eliminate the noisy samples more rapidly. However, a too high MFR value may cause "overfiltering" that removes not only noisy samples but also clean samples. The better performance might be achieved through parameter tuning in consideration of the number of available training samples and the expected ratio of noisy samples. Section V-D provides more discussion on parameter tuning.
3) When Clean Samples are Given: In the previous SFL experiments (whose results are in Tables I and II), both training and validation sets were noisy. Actually, the validation set was sampled from the same noisy map as the training set. Although some samples must be clean, we assume that we do not know which one is clean. However, in practice, there is a chance to have a small number of known-as-a-clean samples and a large number of noisy samples. Thus, we assumed a situation where clean samples were given, and we used the clean set for the validation set. That is, we added simulation errors only to the training set and used hand-digitized labels for the validation set.
Tables III and IV provide the results when known-as-a-clean samples (hand-digitized labels without simulated errors) are used as a validation set. For training samples, simulated noisy maps were used as same as the previous SFL experiments. So,  all experimental setups are the same as the previous cases for Tables I and II except that the validation set is composed only of clean samples and the filtering is performed only on training samples. As a result, SFL-U-Net significantly improved the baseline performance in most cases, indicating that the clean validation set helps to find the optimal parameter of the deep model. The results suggest that SFL could be also useful if given clean samples are used as a validation set.

B. Performance of SFL: Real-World Noisy Maps
Raw OSM and MBF of both Purdue University Area and Indianapolis Center and their corresponding RGB-NIR orthoimages were collected as training datasets to evaluate the performance of SFL. Note that all training datasets of real-world noisy maps have no known-as-a-clean sample. For evaluation, ground truth maps of randomly selected regions, separated from the training datasets, were used. SFL-U-Net, Filtered-U-Net, and U-Net were compared. Table V tabulates the IoU values to present the semantic labeling performance of the three different methods in the two different types of noisy maps and in the two different areas. Overall, SFL-U-Net obtained the highest IoU among the three methods. Compared with Filtered-U-Net, SFL-U-Net obtained higher IoU values in all cases. In the case of the Purdue University area, Filtered-U-Net even obtained a lower IoU than the baseline. This suggests that the training samples remaining after the filtering may not have been sufficient to train the U-Net from scratch, and at the same time, it suggests that SFL-U-Net could have benefitted from even noisy samples and is less likely to fail in training due to the severe lack of training samples.
The Purdue University area was the only exception in which SFL-U-Net obtained less IoU than the baseline. One possible reason for this exception is the poor quality of the training label. This is because the filtering is based on the loss calculated in the deep model, and the credibility of the loss is affected by the performance of the deep model itself. As shown in Fig. 3, the OSM label of Purdue University Area contains many  VI  COMPARISON OF SFL-U-NET, FILTERED-U-NET, AND U-NET TRAINED WITH BOTH OSM AND MBF omission errors, and its spatial coverage of its annotation is quite heterogeneous. It might have caused the failure in U-Net training as well as filtering. In the case of the same area with MBF, where the overall IoU values were significantly higher than with OSM, both SFL-U-Net and Filtered-U-Net showed significant improvements over the baseline. It implies that SFL is effective when at least a certain amount of training samples is ensured to train its deep model. This observation is also confirmed in Section V-E that provides additional experimental results to investigate the impact of training dataset size. The relevant finding is also confirmed in [23], which states that "training on noisy labels does work well, but only with substantially larger training sets." How many samples are needed will depend on the experimental area, the quality of the image, and the noise level.
To validate the assumption that sufficient training samples, albeit noisy, could be more beneficial to SFL-U-Net, the training samples were augmented by combining training datasets of both Purdue University Area and Indianapolis Center as a training source. In other words, the training and validation samples of both areas were combined, and the test samples of each area were evaluated separately. The results are given in Table VI. SFL-U-Net obtained higher IoU values without any exception. Note that there has been no additional labeling cost as we use free OSM labels but simply let the deep model use a larger area to train. Fig. 4 illustrates the learning process of SFL by showing how a noisy sample is filtered out and how the filtering affects the final model in comparison with the baseline U-Net. In Fig. 4, both U-Net and SFL-U-Net started with the same training samples of the simulated map but produced different outputs at the end. The training image, the training label, and the ground truth are also shown together. Note that both models train with the noisy label, not the ground truth.

C. Visualization of the SFL Process and Results
At epoch 2, both deep models were in the very early stage of learning. For SFL, the deep model was considered not to be capable of distinguishing noisy samples yet based on the model's convergence rate and the parameter CRT, so SFL-U-Net did not start filtering. At epoch 6 of SFL-U-Net, the sample was filtered out according to the ranking of loss, taking into account criteria determined by CRT, MFR, and SL. In other words, SFL-U-Net no longer included the noise sample for its training, whereas U-Net kept the noise sample for its training. In the end, by the early stopping, the final models of SFL-U-Net and U-Net were determined at epoch 17 and 20, respectively. As shown, SFL-U-Net was able to predict building footprints properly by filtering out the noise label before the deep model was overfitted to the noise sample, despite not using its ground truth at all. In contrast, since the baseline U-Net did not filter out the noise sample, it ended up being overfitted to the noise training label.
The advantage of SFL was also clearly confirmed in the final building map outputs. As shown in Fig. 5, building maps from SFL-U-Net tend to have clear boundaries and have less omission and commission errors. On the other hand, the baseline U-Net often produced wrong building footprints in the roads and blurred building boundaries. These errors may have been caused by some poorly labeled samples that could confuse the deep model. As shown in Figs. 4 and 5, SFL enables the deep model to prevent the model from being confused by the noise label and leads the model to produce better results.

D. Suggestions for Parameter Selection
To help select the proper parameters and understand the impact of the parameters, the performance of SFL was investigated with various combinations of parameters. Fig. 6 shows the accuracy gains (%) of SFL-U-Net and Filtered-U-Net over the baseline U-Net according to the change of CRT, MFR, and SL, respectively. The accuracy gain was calculated by the increasing rate of IoUs of SFL-U-Net and Filtered-U-Net compared with the IoU of baseline U-Net with default parameter values (CRT, MFR, and SL is 0.2, 0.1, and 0.3, respectively). While changing each parameter, other parameters were fixed to their default values. The simulated map of a 40% noise sample ratio was used for the experiment. Furthermore, the removed sample ratio (%) was obtained to observe how many samples were removed by SFL. Table VII tabulates the removed sample ratios when different hyperparameters were used in the experiments. The removed sample ratios of the training and validation samples were calculated separately. Please note that SL only applies to the validation set as described in Section III.
It was found that MFRs ranging from 0.05 to 0.45 neither significantly affect the performance nor the removed sample ratio. Both SFL-U-Net and Filtered-U-Net improved baseline performance. MFR is the upper limit of the filtering rate per epoch and is not the only factor determining the filtering rate, as SFL linearly adjusts the filtering rate from 0 to MFR considering the convergence rate of the model and CRT. Thus, it indicates that the performance of the SFL is not significantly affected by the MFR, especially if both MFR and CRT are within reasonable ranges.
Experimental results revealed that SL was more influential in performance than MFR. As shown in Fig. 6, when SL was 0.1 or 0.2, the accuracy gain was about 5%, but when SL was 0.5 or 0.6, the accuracy gain was about 10%. Considering that the noise sample ratio was 40%, the results indicate that SFL could perform better when SL is larger than the noise sample ratio. This is because SL values smaller than the noise sample ratio cause the underfiltering. Table VII shows that the removed sample ratios were considerably smaller than the noise sample ratio (40%) when SL is 0.1-0.3, whereas the removed sample ratios were close to the noise sample ratio when SL is larger than 0.4. There were some cases where the removed sample ratio was larger than the noise sample ratio, which means that SFL might have removed samples more than necessary. Although the noise sample ratio should be unknown in practical usage, the results indicate that selecting an SL value slightly larger than the expected noise sample ratio would be acceptable.
CRT was the most influential parameter on SFL performance. MFR, SL, and CRT are associated with each other and complementarily affect the filtering rule. However, CRT is the only parameter that can consider the training dynamic of the model. SFL determines when to remove samples based on the  CRT. If the CRT is too large, SFL can remove samples even if the model is not sufficiently trained. On the other hand, if the CRT is too small, the SFL may not have enough chance to filter the samples. As given in Table VII, when the CRT is larger than 0.3, the removed sample ratios of the validation set always reached SL (30%), and the removed sample ratios of the training set significantly exceeded the noise sample ratio. As SL limited the number of removed validation samples, SFL-U-Net was able to obtain higher IoU values than U-Net, as shown in Fig. 6. However, as Filtered-U-Net uses the remaining training samples, whose amount can be reduced without a limit, the performance of Filtered-U-Net was significantly degraded when CRT is larger than 0.3. CRT is to prevent SFL from filtering when the deep model is unstable. The results show that SFL can overly remove training samples if CRT is too large (>0.3). Therefore, especially when SL should be large, small CRT values (0.1-0.2) are recommended.
Experiments with various parameter values showed how the parameters impact the SFL performance in terms of accuracy and the number of filtered samples. As a result, MFR within a 0.05-0.45 could produce fine results if the CRT is set properly. SL is recommended to be slightly larger than the noise sample ratio. CRT no greater than 0.3 is recommended for which an appropriate SL value is uncertain. It should be mentioned that the experiments were conducted with the known noise sample ratio, and other parameters were fixed while changing each parameter. However, by observing the accuracy gain according to the different hyperparameter values, we confirmed that SFL was able to improve the baseline performance in most of the parameter space.
To observe the performance of SFL with a more aggressive filtering setting, we repeated the same experimental design but with different values for hyperparameters. In the previous experiments, CRT, MFR, and SL were set to 0.2, 0.1, and 0.3 by default. However, in the aggressive filtering setting, CRT, MFR, and SL were set as 0.3, 0.2, and 0.5, respectively, and this parameter combination is referred to as an "aggressive filtering strategy." Increased CRT increases the chance that SFL does filtering. Increased MFR allows SFL to filter out more samples at each epoch. Increased SL allows SFL to filter up to 50% of the training samples. As a result, we found the aggressive filtering strategy particularly gains more accuracy with the high noise sample ratio (>0.3) compared with the default parameter setting. However, when training sample is severely limited, the aggressive filtering was not successful as it has a risk of too-rapid-filtering that might cause the shortage of training samples before the model learns the building representations properly. To sum up, although the best parameter combination should vary depending on the condition of given training samples, we found SFL is robust when the parameters are within reasonable boundaries. Also, if the training labels are noisy but the number of samples is sufficient, the aggressive filtering strategy could obtain better accuracy.

E. Performance of SFL According to the Training Dataset Size
To explore the performance of SFL with more varied training dataset sizes, we added three different dataset sizes. Thus, a total of four different dataset sizes (200, 400, 800, and 1600 tiles) were used to compare the performance of SFL-U-Net, Filtered-U-Net, and U-Net. Also, a total of eight different noise levels (0%-70% noise sample ratios) were used for each dataset size accordingly. The aggressive filtering strategy described in Section V-D was adopted for all cases, and all three methods were trained with the same experimental condition, as illustrated in Section IV-C. The IoU results are shown in Fig. 7. With these various conditions, the baseline U-Net showed a fairly large gap in its IoU values, ranging from 0.599 (200 tiles with a 70% noise sample ratio) to 0.855 (1600 tiles with a 0% noise sample ratio). Also, considering the marginal difference in IoU between 800 tiles (0.841 IoU) with a 0% noise sample ratio and 1600 tiles (0.855 IoU) with a 0% noise sample ratio, the results with 1600 tiles can be considered to be close to the maximum obtainable performance.
When 200 tiles were used, the improvement by SFL was not significant in many cases. The results indicate that SFL may not perform well when the quality and quantity of the training dataset is severely limited. In contrast, when 1600 tiles were used, SFL showed performance improvement in all noisy dataset. Another notable thing is that the IoU difference between the 0% and 70% noise sample ratios decreased as the size of the training dataset increased. The IoU of U-Net decreased by 23.0%, 21.6%, 16.1%, and 10.5% when 200, 400, 800, and 1600 tiles were used, respectively. This means that the deep model tends to become more robust to label noise as the amount of training dataset increases. Nevertheless, the performance degradation of the baseline U-Net due to noise always occurred, which is also observed in [23], and meanwhile, SFL consistently helped to improve the performance of the deep model. The results show that SFL is still effective in noisy datasets even if the volume of the training dataset is fairly large.

F. Performance of SFL With Different Architectures
To confirm the robustness of SFL, additional experiments were conducted with four extra semantic segmentation networks (i.e., FCN [47], LinkNet [70], SegNet [71], and DeepLabv3+ [72]). For FCN, we used a standard FCN structure that uses VGG16 [73] as an encoder and upsamples to the original resolution without any skip-connection unlike to U-Net. LinkNet with ResNet18 [74] encoder was used. SegNet, which uses pooling indices of a lighter VGG16 encoder, was used. Finally, DeepLabv3+ with Xception [75] encoder was used. The aggressive filtering strategy was adopted for all noise sample ratios, and all four networks were trained with the same experimental condition applied to the U-Net experiment, as described in Section IV-C. Table VIII tabulates the semantic segmentation performance (IoU) with different architectures. Similar to the results with U-Net, SFL improved baseline performance in general. SFL was particularly effective when the noise sample ratio was 20%-50%. When the noise sample ratio is 0%-10% or 60%-70%, the improvement was not made or was not significant. Similar to the findings of Section V-E, the results indicate that SFL can enhance the model's performance unless the quality of training dataset was severely low. SFL was particularly helpful when the model has high complexity (e.g., DeepLabv3+) as the deep complex model can be more susceptible to overfitting in general.
In addition to IoU, the number of training epochs (NTE) that results in the best performance on the validation set is investigated. As early stopping was used in our experiments, the NTE can be considered as the epoch just before the start of overfitting. Table IX tabulates the results of NTE for different architectures. As a result, NTEs between the SFL-and the baseline were not significantly different in most cases unless the baseline was overfitted very quickly and failed to produce satisfactory results (please refer to results from DeepLabv3+). It indicates that SFL prevents overfitting. The NTEs of Filteredwere larger than others in most cases. This is because Filtereduses a relatively small number of refined training samples that are a subset of the entire training dataset, and refined training samples can delay the occurrence of overfitting. Another notable thing is that NTE tends to decrease as the noise sample ratio increases. This indicates that the more noisy samples in the training dataset, the sooner overfitting can occur.

G. Discussion of Results and Limitations
SFL performance was evaluated in various types of error, different levels of noise, and different types of noisy maps (i.e., simulated map, OSM, and MBF). SFL was found to be robust and effective in various error types, different noise levels, and different types of real-world noisy maps by improving the baseline performance. Compared with Filtered-U-Net, SFL-U-Net that simultaneously performs learning and filtering was superior considering both accuracy and computational efficiency. SFL-U-Net leads the deep model to progressively learn refined samples  IX  NTE THAT RESULTS IN THE BEST PERFORMANCE ON THE VALIDATION SET WITH DIFFERENT ARCHITECTURES IN THE EIGHT DIFFERENT NOISE LEVELS in a multi-round manner and selects the best state of the model during the filtering while Filtered-U-Net starts its training with the fixed, filtered samples. It is important to note that not all removed samples must be harmful to the training as the filtering accuracy is not perfect and some removed samples might have had informative features to learn even if they contain some label noise. This is why SFL-U-Net is more robust than Filtered-U-net even if Filtered-U-Net can use a less noisy dataset. In addition, when clean samples were given, SFL can further improve the deep model's performance by simply using the clean samples as a validation set.
Experiments with different combinations of parameters confirmed the impact of parameters and provided reasonable boundaries of parameter values. Also, we found the aggressive filtering strategy could enhance the model's performance by preventing underfiltering when the proportion of noisy samples is large, but it could increase the risk of eliminating clean-but-difficult samples. Some prior knowledge, such as the number of available training datasets and the approximate proportion of noisy samples, might help parameter selection and further enhance the performance of SFL. For example, if the available open map contains much more labels but their labels are very noisy, the aggressive filtering might lead to more satisfactory results.
SFL adjusts its filtering rate adaptively, considering both the convergence rate and the initial hyperparameter values. Although SFL was not sensitive to parameter selection as long as parameters are within recommended ranges, careful consideration is required when the training sample is not sufficient. SFL has advantages in that its filtering rule is simple and does not require additional heavy computation. However, more complex sample filtering standards, which were originally proposed for classification tasks, such as monitoring learning dynamics [54] or utilizing auxiliary network approach [52], [53], could be considered in future works. For example, considering the loss dynamics of each individual sample could provide more sophisticated criteria for each sample instead of the current rank-based filtering. In addition, SFL's performance might be limited in the case where the rank of loss can be significantly influenced by the difficulty of specific classes (e.g., multiclass land cover mapping) rather than the presence of labeling error. Finally, SFL may not be effective when available training labels were severely limited, as shown in Fig. 7. It suggests that either refining existing labels or preparing a pretrained model as a prior step could be beneficial before applying SFL to learn the noisy building labels.
We want to note that our research scope was not to find the best performing algorithm for building extraction tasks but to contribute to another category of effort to address how we can utilize noisy building labels based on the simple idea that not all training labels are helpful. With this goal in mind, we demonstrated the robustness of SFL in a wide variety of noisy building labels and provided detailed suggestions for parameter selection. However, we still witnessed a performance gap of SFL compared with the strong supervision case. Although not investigated in this study, integrating SFL with label-refinement [41], [42], [43], pretrained models [11], [23], or semisupervised methods [12], [13], [14] may further close the gap to the strong supervision in practical use.

VI. CONCLUSION
This article presents a simple yet effective learning method, entitled SFL, that can improve the building extraction performance of a deep model in remote sensing images under noisy supervision. The key of SFL was based on the idea that not all building labels improve the performance of deep model. SFL lets the deep model simultaneously filter and train the training samples based on the loss of each sample so that the model can concentrate more on the conducive samples while alleviating the negative impact of the noisy samples. Experiments confirmed that SFL is effective in diverse label noise scenarios.
The scope of this study focuses on validating the effectiveness of SFL for building extraction in remote sensing in the presence of label noise. Extensive experiments with the simulated map and real-world noisy maps demonstrated that SFL can improve the performance of the deep model in diverse error types and different noise levels. Also, the performance of SFL was evaluated with multiple deep architectures and various sizes of training dataset. The impact of the parameters was also investigated. Based on the results, we can conclude that SFL can improve the performance of deep learning for building extraction unless the number or quality of a given training dataset is severely limited.
Although SFL was designed for building extraction tasks, the idea of SFL, filtering samples based on their losses, could be extended to different semantic segmentation tasks, especially if the label quality significantly varies locally. A task of road extraction, tree extraction, parcel mapping, and other binary semantic segmentation tasks could be an example that SFL can be used for future works.
We expect our work contributes to reducing the annotation cost as well as exploiting the potential of noisy but inexpensive open labels. In addition, as SFL can be compatible with the previous efforts (i.e., label-refining methods and noise-robust architectures) for addressing label noise, we hope that SFL can be used as an additional and complementary method to deal with noisy maps, not as a replacement of previous methods. In the future, we would extend our findings to other semantic segmentation tasks with different sources of data and to develop a constantly updating map utilizing open data.