The Outcome of the 2021 IEEE GRSS Data Fusion Contest - Track DSE: Detection of Settlements Without Electricity

—In this article, we elaborate on the scientiﬁc outcomes of the 2021 Data Fusion Contest (DFC2021), which was organized by the Image Analysis and Data Fusion Technical Committee of the IEEE Geoscience and Remote Sensing Society, on the subject of geospatial artiﬁcial intelligence for social good. The ultimate objective of the contest was to model the state and changes of artiﬁ-cialandnaturalenvironmentsfrommultimodalandmultitemporalremotelysenseddatatowardssustainabledevelopments.DFC2021 consistedoftwochallengetracks:Detectionofsettlementswithoutelectricity(DSE)andmultitemporalsemanticchangedetection. WefocushereontheoutcomeoftheDSEtrack.Thisarticlepresentsthecorrespondingapproachesandreportstheresultsof thebest-performingmethodsduringthecontest.


I. INTRODUCTION
T HE 2021 Data Fusion Contest (DFC2021), organized by the Image Analysis and Data Fusion Technical Committee (IADF TC) of the IEEE Geoscience and Remote Sensing Society (GRSS), promotes interdisciplinary research on geospatial artificial intelligence (AI) for social good [1]. IADF TC is an international network of scientists working on earth observation, geospatial data fusion, and algorithms for image analysis. It aims at connecting people and resources, educating students and professionals, and promoting theoretical advances and best practices in image analysis and data fusion.
Since 2006, the IADF TC organizes an annual challenge, i.e., DFC, for fostering ideas and progress in remote sensing, distributing novel data, and benchmarking analysis methods [2]- [16]. The contest follows the ultimate goal of building models to understand the state and changes of artificial and natural environments using multisensor and multitemporal remote sensing data towards sustainable developments. The contest is designed as a benchmark competition following previous editions [13], [14], [16]- [18].
DFC2021 is dedicated to the following real-world social challenges: 1) Analysis of multisensor, multiresolution, and multitemporal data, and 2) learning from low quality labeled samples (weak supervision). The aforementioned problems are major open challenges in a wide range of fields, from Earth observation to computer vision and machine learning [1]. The main feature of the contest is that it tackles directly some of the most challenging social problems such as energy equality and environmental conservation. In other words, the results of the contest will not only lead to technological development, but also develop a tool for solving actual social problems [1]. DFC2021 includes the two following tracks, which were run in parallel as follows.
1) Track DSE: Detection of settlements without electricity; 2) Track MSD: Multitemporal semantic change detection. This article present the main outcome of Track DSE. This track was co-organized by Hewlett Packard Enterprise, So-larAid, and Data Science Experts, to address the automatic detection of human settlements deprived of access to electricity using multimodal, multiresolution, and multitemporal satellite remote sensing data. Sentinel-1 SAR data, Sentinel-2, Landsat-8, and Suomi Visible Infrared Imaging Radiometer Suite (VIIRS) night time images were utilized as input (see Fig. 1). While the original ground sampling distance (GSD) ranges from 10 m to 750 m, all images were resampled at 10 m. Four classes of semantic labels (i.e., settlements with and without electricity, no settlements with and without electricity) were provided at a GSD of 500 m for the training set. Participants are required to submit binary classification maps of settlements without electricity at a GSD of 500 m. The classification accuracy of different solutions was estimated using the F1 score. The main challenge of Track DSE is to develop robust and efficient methods to extract high-level semantic information from such heterogeneous data.
In this article, we describe the datasets used in DFC2021 in Section II, and discuss the overall results of the competition in Section III. Then, we will focus in more detail on the approaches proposed by the first-ranked teams of Track DSE in IV and Section V. Finally, Section VI concludes this article.

II. DATA AND BASELINE OF THE DATA FUSION CONTEST 2021
The data of the DFC2021 mapped the Dezda and Salima sectors in Malawi. It has been split into 98 tiles of 800 × 800 pixels. The tiles have been distributed, respectively, across the training, validation and test sets as follows: 60, 19, and 19 tiles. Each tile included 98 channels from the below listed satellite images. All the images have been resampled to a GSD of 10 m. Thus, each tile corresponded to a 64 km 2 area. 1) Sentinel-1 polarimetric SAR: Two channels corresponding to intensity values for VV and VH polarization at a 5 m ×

III. ORGANIZATION, SUBMISSIONS AND RESULTS
There were 168 unique registrations at the CodaLab competition website 1 during the development phase and 28 teams entered the test phase after screening the descriptions of their  I  TOP RANKED TEAMS AND THEIR APPROACHES IN TERMS OF USED DATA, PROBLEM SETTING, AND CLASSIFICATION MODELS approaches submitted by the end of the development phase. In total, 3456 submissions were received during the development phase with active participation from all registered teams. During the test phase, the maximum number of submissions per team was limited to ten, and 170 submissions were received. The final ranking was determined based on F1 score.
The first to fourth ranked teams were awarded as winners of the DFC2021 Track DSE and presented their solutions during the 2021 IEEE International Geoscience and Remote Sensing Symposium. The four winning teams are the following.  [24]. The top four teams commonly employed techniques such as preprocessing, including data normalization and data augmentation, ensembling (or model fusion) to integrate the results of multiple models, and test-time augmentation. All teams utilized recent neural network architectures (or modules/layers), which have been shown to be effective in the field of computer vision, for classification and built models that were well customized to the task of the contest. Since multisensor data were provided and the task was more complex than typical building detection, various approaches were developed in terms of which data to use and what problem setting to solve. Table I summarizes the differences in the approaches of the top four teams regarding used data, problem setting, and classification models. While all teams used VIIRS data for electricity detection, the use of Sentinel-1, Sentinel-2, and Landsat-8 data for settlement detection was different: three teams used only Sentinel-2 data, while one team (dimartinot) utilized all data and showed its effectiveness in the ablation study [23]. For this contest, Sentinel-2 data alone were sufficient for settlement detection, but the use of multi-sensor data is worthy of further investigation in order to scale up more widely and improve generalization performance.
Interestingly, there was no consensus on whether the task should be solved as a single classification problem or as two classification problems, i.e., settlement detection and electricity detection; similar performance was achieved with different approaches. As for the classification models, all teams adopted CNN-based methods for processing relatively high-resolution Sentinel-2 images. On the other hand, different methods such as thresholding, random forest, and CNN were used for electricity detection when the dual task is employed, suggesting that simple methods may be sufficient for processing low resolution images.

IV. FIRST PLACE TEAM OF TRACK DSE
In this section, the winning algorithm in the track DSE will be described. In order to detect settlements without electricity, we design the dual-task models, i.e., we divide the task into settlement detection and electricity detection, as shown in Fig. 2(a). Specifically, we first preprocess the images and propose a novel outlier removal method for the VIIRS night time dataset. Next, SENet [25] is used to train multiple classifiers for the two detection tasks. Then, the classification results of the two tasks are postprocessed separately, which contains model integration and correction of misclassification based on expert priors, etc. Finally, the results of settlement detection and light detection are fused to obtain the final detection result. The experiments show that our preprocessing method can maximize the filtering of interference information and noise in the images, allowing the dual-task models to make full use of the valuable information in the multisource data, and the postprocessing approach further utilizes the priori knowledge to effectively correct the misclassification, which together contribute to the outstanding performance of our model.

A. Data Preprocessing
Data preprocessing is an essential part of our method and plays a crucial role in the performance of the subsequent detection models, especially for the electricity detection. It consists of two important steps.
1) Considering that the noise in the data affects the performance of the model, outliers of the multisource images need to be removed. 2) Due to the small number of training samples, the true distribution of the data is not well reflected and the feature diversity is not sufficient, so we use data augmentation to enrich the features. Sentinel-2 Multispectral Dataset: We use 4, 3, and 2 channels to synthesize the true color images. To remove extreme outliers, the pixel values of the three channels are clipped to the interval [0,2000] [26], and then the synthesized images are normalized to [0,255]. Since the ground sampling distance of the true color image is 10 m × 10 m and that of the ground truth is 500 m × 500 m, it means that for the 800 × 800 pixels image, it is necessary to distinguish whether there is a human settlement in each 50 × 50 pixels region. Therefore, each true color image is cropped into 256 smaller images of 50 × 50 pixels, and they are classified into two categories of "settlements" and "no settlements" based on the labels, thereby yielding a training set that can be used to train a model to classify the presence or absence of settlements in smaller images. To enrich the features, the training set is augmented with five data augmentation methods: 90, 180, and 270 degrees of rotation, horizontal flip, and vertical flip, respectively.
VIIRS Dataset: To obtain cleaner data, we remove the noise in each channel using the proposed formula where pixelc i,j represents the pixel values of the image, c denotes the name of the channel, and i, j ∈ [1, 800]. All pixel values of a total of 60 training images from the same channel c are sorted from smallest to largest, and L c_min and L c_max denote the pixel values located at 5% and 95% of them, respectively. Fig. 2(b) shows the pixel changes after removing the outliers using a channel as an example.
For each tile of the training set, we sum the pixel values of the 9 images after removing outliers and normalize them to the interval [0,255], thereby yielding one night time image corresponding to the tile. Since the ground truth is a coarse label of 16 × 16 pixels, we use an average pooling of 50 × 50 pixels instead of max-pooling to generate a new image of 16 × 16 pixels. The reason for not using max-pooling is that it would cause the original area with light to be overly expanded and more prone to false detection. The images are also augmented using the five data enhancement methods described previously to obtain a night time dataset for training a model to classify the presence or absence of electricity in an image.
Taking tile14 as an example, Fig. 2(c) shows the differences in images due to outlier removal on the two datasets, respectively. It can be seen that our outlier removal method is critical, especially for the electricity detection, which filters out the interference of noise to make the regions with electricity more accurate.

B. Dual-Task Models
Transfer Learning: After preprocessing, two datasets are available for settlement detection and electricity detection, respectively. For both classification tasks, we use SENet154 pretrained on ImageNet for transfer learning. The channel attention mechanism allows the network to learn the weights of channels adaptively, assigning different weights to different channels. Multiple training models can be obtained by setting different hyperparameters. The base settings are batchsize of 32, epoch of 15, and initial learning rate of 0.001, and the learning rate decays to 90% of the original for each epoch during training.
Settlement model and electricity model: After training, multiple models for detecting settlements can be obtained and they are called settlement models. In addition, we define a portion of 50 × 50 pixels in an image as a region, which means that an image consists of 256 regions. The regions with settlements are denoted as 1, otherwise they are denoted as 0. Similarly, we refer to the model used for electricity detection as the electricity model. It can distinguish for us whether there is an area with electricity in an image.

C. Postprocessing
For further correction to make the predictions more correct, the postprocessing consists of model integration and the proposed approach based on expert priors to correct misclassification for settlement detection, and a threshold approach for electricity detection. The final regions with settlements but no electricity can be obtained by fusing the results of settlement detection and electricity detection.
Settlement detection: The trained models have different preferences for different scenarios due to different settings of hyperparameters. In the testing phase, the predictions of the top-5 best performing classification models on the validation set are integrated. Model integration fuses the outputs of multiple models by voting to improve the settlement model performance.
Correction of misclassification based on expert priors: We use the priori knowledge of the continuity of the building distribution to correct the predictions of the settlement model, thus proposing a correction approach based on expert priors. In Fig. 3(a) is a true color image, and (B) is a binary map of 16 × 16 pixels, which is also the prediction result of (A). An 750 × 750 pixels image is obtained by center cropping (A), and (C) is a binary map of 15 × 15 pixels corresponding to the image. Select a random region in (A), and take the region marked by the orange line as an example, it can see that the region is predicted to be 1 in (B), but is predicted to be 0 in (C), which means that the two predictions conflict. At this time, we check the pixel values of the other 8 regions around the conflicting region in (B), and if the pixel values are all 0, the 1 in the region will be corrected to 0. Otherwise, based on the priori knowledge, the region will not be corrected. Electricity detection: The electricity model is used to classify the presence or absence of areas with electricity. If there is an area with electricity, then we use the threshold approach to select specific regions with electricity in an image. The presence of electricity is denoted as 1 and the absence as 0. First, when an area consisting of multiple regions does not contain any independent regions, i.e., when the regions are adjacent to each other, the area is called a contiguous area. Next, if the number of regions contained in a contiguous area is not less than 4, the area is defined as a valid contiguous area, and the opposite is an invalid contiguous area. Invalid contiguous areas are considered as a kind of noisy data to be removed. Then, the pixel values in an image are sorted from smallest to largest and the pixel value located at 85% is taken as the initialization threshold. Regions with pixel values greater than the threshold are set to 1 and those less than it are set to 0. Finally, for each tile, a 16 × 16 pixels binary map is obtained. Experience shows that the best performance is achieved when the final threshold is the pixel value located at 85-90%, and the binary map after removing invalid contiguous areas represents the electricity detection result.
Model fusion: In order to detect settlements without electricity, the results of settlement detection and electricity detection need to be fused. The fusion strategy is shown in Fig. 4

D. Results and Discussion
Settlement model: Different networks [27], [28] are trained on the Sentinel-2 dataset. Table II shows the performance comparison of multiple models and operations on the validation set, and it can be seen that SENet154 performs the best, so it is still used in the test set. In addition, the data preprocessing method can significantly improve the settlement model performance.
Electricity model: Based on the performance of different models in the settlement detection, we continue to use SENet154 as the backbone architecture of the electricity model. Since there are only two images classified as having the presence of electricity in the validation set, which is much less than the number of images with electricity in the test set. However, Table II also shows the remarkable role of the electricity model. Table III shows the improvement in F1 scores on the test set by adding various operations. When using only a single settlement model for prediction, F1 is 84.14%, and after integrating the results of five settlement models, F1 reaches 85.30%. Fusing the integrated results with those of the electricity model, F1 is 88.64%, and after postprocessing, F1 can reach 89.39%. It can be seen that the electricity detection is crucial and improves the F1 score by about 3%, while the post processing based on the expert priors also shows a favorable performance.

A. Proposed Framework
The flowchart of our method is shown in Fig. 5. The method mainly consists of three steps, which can be summarized as follows.

1) Data Preprocessing:
The steps of data and band selection, data noise removal and data augmentation are implemented to process multimodal and multitemporal remote sensing data, so as to obtain a reliable and considerable training dataset.
2) Multimodel Fusion: Single-task model is built to detect settlements without electricity and dual-task model is built to detect settlements and electricity individually, and the strategy of integrating different models is applied for further improving the result.
3) Postprocessing: In order to further eliminate the interference of noise, the threshold segmentation methods are adopted for producing the final result.
The specific operations of each step and experimental results will be introduced in the following sections.

B. Data Preprocessing
The data preprocessing step is quite essential in the whole framework, for the reason that data selection lays the foundation for the subsequent results. In this section, three parts are included: data and band selection, data noise removal and data augmentation, which aim at removing degraded data and retaining valid data.
The first step is to select useful data and bands for the task of the detection of settlements without electricity. The optical imagery is used for the detection of settlements by providing abundant spectral information. Landsat-8 and Sentinel-2 are both optical data, while the spatial resolution of Sentinel-2 is higher than that of Landsat-8 and the cloud pollution of Landsat-8 data is more severe than that of Sentinel-2 data. Thus, Sentinel-2 data are selected for the detection of settlements. According to the commonly used building indices like NDBI and UI [29], classical bands for building detection including red, green, blue, near-infrared, SWIR-1 and SWIR-2 of the Sentinel-2 image are picked for the following steps. In addition, although SAR data can be effective for the settlement detection with structural information, the provided Sentinel-1 data with severe speckle noises will result in poor results, thus, the Sentinel-1 data are discarded. In order to detect the electricity condition, multitemporal VNP46A1 products are also chosen for the following parts.
The second preprocessing step aims to remove the data noise. Optical images are inevitably contaminated by clouds, thus, the cloud noise removal is of vital importance. A modified median filter is adopted in this part to composite cloud-free images from multitemporal data [30]. Furthermore, the VNP46A1 product data has light value variations at different times, and effective feature information should be extracted from multitemporal data to avoid noise interference. The accumulative, median and maximum values of multitemporal night time data are composited to generate new images. Moreover, a 2% linear gray scale stretching algorithm is then implemented on all selected and generated bands to eliminate extreme noisy points and enhance the image contrast.
The final preprocessing step is data augmentation, which can increase the amount of training data to avoid the overfitting of the model and improve the model generalization ability. In this Fig. 6. Workflow of single-task and dual-task model. part, the operation of rotation and flipping, and first rotation then flipping multiplies the amount of data eightfold, providing the foundation of robust network training [31].
The single task of DSE can be decomposed into two tasks: detection of settlement (DS) and detection of electricity (DE), so that a single-task and a dual-task model can be both formed. As shown in Fig. 6, in these two models, an end-to-end global context convolutional neural network (GC-CNN) is designed and applied with the same architecture. However, the input and output of two models are different. The single-task model has only one GC-CNN, the input of which is the cascaded data of the 800 × 800 Sentinel-2 imagery and night time data, and the output of which is a 16 × 16 segmentation map of the settlement without electricity. In the dual-task model, two GC-CNN are adopted with different input data: Sentinel-2 data and the original night time imagery. The output of two networks are the segmentation map of settlement and electricity. The final result can be obtained by subtracting the two results.

C. Multimodel Fusion
In the GC-CNN, three basic modules are utilized: convolution block, pooling block and global context (GC) block [32]. Among them, the convolution block is used for better extracting the deep features in the imagery, while the pooling block is applied to reduce the dimensionality of information. The GC block plays a vital role in extracting the global context information in the image and capture the long-distance dependence of the deep neural network. The specific operations of the global context block adopted in this paper are shown in Fig. 7. First, a global attention pooling operation is used for context modeling, and a bottleneck transformation is then applied to capture channel dependencies. Finally, by the operation of broadcast element addition, features can be further fused. In addition, due to the imbalance of label categories, the focal loss function is used in network training, which is described as follows:  In (2), p denotes the predicted result, and y represents the true label. α and γ are the weight parameters, which are always set as 0.6 and 2, respectively [33]. It is worth noting that a creative voting strategy is applied in the testing phase. Among the eight predicted results generated by different data augmentation methods, if more than two prediction results are labeled as 1, the pixel will be marked as positive. On the one hand, although the single-task model can directly produce the final result, the fusion of night data and optical data may be insufficient. On the other hand, the dual-task model makes full use of the characteristics of the data, but there may be error accumulation in the multitask prediction process. Therefore, an empirical fusion strategy is presented in Fig. 8. The first step of model fusion is to take the intersection of the two model results for producing a highconfidence label, but there are many omissions and errors. On the basis of these results, the confidence map of the network and the results of DE can be further processed by the classic threshold segmentation method, so that some high-confidence residential areas without electricity can be added to the intersection result. In Fig. 8, E-num represents the number of powered pixels in each tile, while Voting-num represents the number of positive results in the voting strategy (DS-GC-CNN). Confi is the confidence value of each pixel in DSE-GC-CNN.

D. Postprocessing
After the abovementioned steps, there are still some noise points in the multimodel fusion results, which can be further refined by threshold segmentation. First, we suppose that when the number of settlements without electricity in a tile is less  than three, the entire tile has no residential areas. In addition, when each tile has a tile with E-num greater than 30, pixels with a maximum value between 18 and 30 and pixels with a total value greater than 30 are regarded as powered pixels and will be excluded.

E. Results and Analysis
The multitemporal and multimodal experimental data are provided by the organizer. The preprocessed training data is split with 80% used for model training and 20% used for model validation. The final performance of our proposed approach is evaluated on the test data of 19 tiles by F1 score, which is a comprehensive evaluation index that combines the precision and recall rate.
The visualization detection results of diverse models and periods are shown in Fig. 9 , where the results of intermediate processes and methods including single-task model, dual-task model, model fusion, and postprocessing are compared. The corresponding quantitative results of the detection of settlements without electricity are displayed in Table IV. Several conclusions can be drawn by comparing visual results and quantitative results. First, in the both results of the single-task model and the dual-task model, the precision value is lower than the recall value, which indicates that the problem of overdetection exists in both two models. In addition, although the F1 score values of the two models are close, the visual detection results are inconsistent. The recall value of the dual-task model is higher than that of the single-task model, and the precision value of the single-task model is higher than that of the dual-task model. Besides, the F1 score value rises and the recall value and precision value is closer to each other after model fusion of the single-task model and dual-task model, demonstrating the integration of different models works. Finally, the postprocessing step makes an improvement in terms of F1 score and the precision value. Overall, the experiment results have illustrated that our multimodel fusion framework effectively improves the final prediction results.

VI. CONCLUSION
Remote sensing and earth observation can and do play a key role for reaching the sustainable development goals defined by the UN. They are of direct relevance across multiple targets, goals, and indicators -among them target 7.1 to "ensure universal access to affordable, reliable, and modern energy services". Remote sensing makes on of its strongest contributions in the monitoring of remote areas: A multitude of space-borne sensors with various spectral properties provide imagery with a high spatial and spectral resolution. This data can be leveraged to automatically detect settlements as well as indications regarding their access to electricity.
In this article, we summarize the Track DSE of the 2021 IEEE GRSS Data Fusion Contest, organized by the IEEE GRSS Image Analysis and Data Fusion Technical Committee, that was dedicated to exactly this real-world social challenge: To detect settlements without access to electricity. We describe the challenge to create well-performing machine learning models for the detection of settlements without electricity based on multisensor imagery if only noisy, low-resolution labels are available for training. To this aim, Track DSE provided Sentinel-1 polarimetric SAR, Sentinel-2, and Landsat 8 multispectral data, as well as VIIRS night time data. The images are geographically distributed over two different sites and split into 98 800 × 800 pixel large tiles that were further split into disjoint training, validation and test sets. Additionally, reference data was provided, which consisted of carefully manually annotated semantic maps that indicate the presence/absence of human settlements and access to electricity. The very different image content (e.g., optical versus SAR) as well as sensor characteristics (e.g., large differences in spatial resolution) as well as low resolution labels made the contest challenging yet realistic. This is also reflected by the strategies the participants employed to address these challenges. The top ranking teams all heavily relied on data preprocessing such as normalization to handle the diversity of the input images. Data augmentation was used to increase the amount of training data, while ensemble learning, model fusion, and/or test-time data augmentation helped to mitigate the label noise caused by the low resolution of the reference data. Interestingly, at least within the context of this contest, the Sentinel-2 multispectral images were sufficient to detect settlements. While some participants decoupled settlement and electricity detection and fused the respective results, others addressed the two together. Both approaches yielded a similar performance. While deep neural networks were clearly the dominant machine learning method, other, more traditional approaches showed to be useful as well, including data normalization, simple thresholding, and Random Forests.
The four top ranked solutions of this track presented their methods at IGARSS 2021, while the two top ranking solutions are described in this article in more detail. As in previous years, the DFC2021 attracted global attention with participants well distributed over the world, different affiliations, and career stages. This clearly illustrates the interest of the remote sensing and earth observation community to use the available tools and expertise to contribute to the social good. Furthermore, many of contest participants were students, which shows that the data fusion contest is introduced to early career scientists and used for educational purposes.
The data remains accessible after the DFC2021 on the globally-accessible data platform IEEE DataPort 2 to allow further research and contributions. The CodaLab evaluation server and its public leaderboard 3 was reopened and made accessible from the contest website 4 . Thus, anyone can submit prediction results, obtain performance statistics, compare to other users and hopefully improve on the results presented in this article.
The work described in this article is a first step in the journey helping Solaraid in the detection of settlements without access to electricity at a large geographical scale. The effort made by the different contributors is an invaluable basis to support the collaboration involving Solaraid, Hewlett Packard Enterprise, and DSE. Additional work has already been started to automate the data preparation for a detection at the level of a whole country. Besides, certain teams have expressed their interest in pursuing further experiments to help scaling up the work to a continental level. Discussions are also ongoing to understand the optimal output layout of the algorithm for an easy adoption by the field and an actually operational service.
In addition to the societal impact, the DFC2021 Track DSE states a very interesting challenge from a scientific point of view. Questions like if and how multiple and diverse sensor data should be fused to improve predictions, how to process different spatial and spectral resolutions within one common framework, how to mitigate issues due to label noise or otherwise degraded reference data (e.g., lower resolutions), and how to deal with a limited amount of labeled data in general are far from being solved. Instead, all of these questions represent very active research directions where the future promises significant advancements. In this regard, the data of the DFC2021 provides a valuable benchmark dataset that can be used to evaluate all or only some of these aspects.

ACKNOWLEDGMENT
The IADF TC chairs would like to thank the IEEE GRSS for continuously supporting the annual Data Fusion Contest through funding and resources.