MultiScene: A Large-scale Dataset and Benchmark for Multi-scene Recognition in Single Aerial Images

Aerial scene recognition is a fundamental research problem in interpreting high-resolution aerial imagery. Over the past few years, most studies focus on classifying an image into one scene category, while in real-world scenarios, it is more often that a single image contains multiple scenes. Therefore, in this paper, we investigate a more practical yet underexplored task -- multi-scene recognition in single images. To this end, we create a large-scale dataset, called MultiScene, composed of 100,000 unconstrained high-resolution aerial images. Considering that manually labeling such images is extremely arduous, we resort to low-cost annotations from crowdsourcing platforms, e.g., OpenStreetMap (OSM). However, OSM data might suffer from incompleteness and incorrectness, which introduce noise into image labels. To address this issue, we visually inspect 14,000 images and correct their scene labels, yielding a subset of cleanly-annotated images, named MultiScene-Clean. With it, we can develop and evaluate deep networks for multi-scene recognition using clean data. Moreover, we provide crowdsourced annotations of all images for the purpose of studying network learning with noisy labels. We conduct experiments with extensive baseline models on both MultiScene-Clean and MultiScene to offer benchmarks for multi-scene recognition in single images and learning from noisy labels for this task, respectively. To facilitate progress, we make our dataset and trained models available on https://gitlab.lrz.de/ai4eo/reasoning/multiscene.


I. INTRODUCTION
With the recent development of Earth observation techniques, massive aerial imagery is now accessible for a variety of applications, such as environmental monitoring [1]- [6], urban planning [7]- [12], land cover and land use mapping [13]- [16], and disaster assessment [17], [18].As one of the crucial The work is jointly supported by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. [ERC-2016-StG-714087], Acronym: So2Sat), by the Helmholtz Association through the Framework of Helmholtz AI (grant number: ZT-I-PF-5-01) -Local Unit "Munich Unit @Aeronautics, Space and Transport (MASTr)" and Helmholtz Excellent Professorship "Data Science in Earth Observation -Big Data Fusion for Urban Research"(grant number: W2-W3-100) and by the German Federal Ministry of Education and Research (BMBF) in the framework of the international future AI lab "AI4EO -Artificial Intelligence for Earth Observation: Reasoning, Uncertainties, Ethics and Beyond" (grant number: 01DD20001).(Corresponding authors: Lichao Mou and Xiao Xiang Zhu.) Y. Hua, L. Mou, and X. X. Zhu are with the Remote Sensing Technology Institute, German Aerospace Center, 82234 Weßling, Germany, and also with the Data Science in Earth Observation, Technical University of Munich, 80333 Munich, Germany.(e-mails: yuansheng.hua@dlr.de;lichao.mou@dlr.de;xiaoxiang.zhu@dlr.de)P. Jin is with the Data Science in Earth Observation, Technical University of Munich, 80333 Munich, Germany.(e-mail: pu.jin@tum.de)steps towards these applications, aerial scene recognition has been extensively studied in the remote sensing community.During the last few years, the emergence of deep convolutional neural networks (CNNs) pushed ahead research in this field, and enormous achievements [19]- [26] have been obtained.Albeit successful, most existing scene classification researches only focus on a specific scenario, where an aerial image is assumed to include a single scene [27]- [34].Basically, these studies regard aerial scene recognition as a single-label classification problem and learn models on well-cropped singlescene aerial images (see Fig. 1(a)).However, in practical applications, an aerial image often contains multiple scenes, as it is collected overhead and usually has a large coverage (cf.Fig. 1(b)).We also note that even in public single-scene aerial image datasets, the coexistence of multiple scenes in a single image is inevitable, especially in images covering large areas.For example, as shown in the bottom two images in Fig. 1(a), although they are assigned single scene labels according to their central/dominant scenes (i.e., river and train station), there actually exists more than one scene in each of them.
Hence, in this paper, we aim to tackle a more realistic yet challenging problem, namely multi-scene recognition in single aerial images.This task refers to assigning an aerial image multiple scene labels, and there are no constraints on image preparations, such as centering dominant scenes and eliminating clutter scenes.Compared to the conventional scene recognition task, multi-scene recognition is more arduous because 1) images are large-scale and unconstrained, and 2) all present scenes in an aerial image need to be exhaustively recognized.Fig. 1(b) shows an example of multi-scene aerial image and corresponding multiple scene-level labels.We can see that not only dominant scenes (e.g., residential and woodland) but also trivial scenes (e.g., bridge and parking lot) are annotated, which draws a more comprehensive picture for the unconstrained image.
However, very few efforts have been deployed to this problem in the remote sensing community.In order to advance the progress of multi-scene recognition in single images, we propose a large-scale Multi-Scene recognition (MultiScene) dataset, where 100,000 aerial images are collected around the world.In the phase of data preparation, we note that although massive high-resolution aerial images can be effortlessly obtained from remote sensing data platforms, such as Google Earth1 , it is extremely time-and labor-consuming to yield their corresponding multiple scene labels.To alleviate such annotation burden, in this paper, we resort to crowdsourced data, e.g., OpenStreetMap2 (OSM) annotations, which has been proven to be successful in generating image-level labels [27], [28], [35] and pixel-wise footprints [12], [36] for training deep networks.However, we observe that OSM data might suffer from two common defects, incompleteness and incorrectness, which could introduce severe noise into image labels.Fig 2 shows two examples of incorrect OSM annotations, where (a) sparse shrubs are neglected, and (b) the tennis court is mislabeled as residential.With this in mind, here we do not directly use crowdsourced labels as ground truth data.Instead, we visually inspect 14,000 images and correct their labels, producing a subset of cleanly-labeled images, named MultiScene-Clean.It allows developing and evaluating deep networks for unconstrained multi-scene recognition using clean data.Moreover, we note that the noisy crowdsourced data are not completely useless, for example, they can be used to study network learning with noisy labels for this task.Therefore, we also provide crowdsourced annotations of all images.
The contributions of this paper are four-fold: • Unlike conventional aerial scene recognition where all images are well-cropped and each of them contains only one scene-level label, in this paper, we explore a more practical task-multi-scene recognition in single images.• We propose a large-scale dataset, namely MultiScene, consisting of 100,000 unconstrained multi-scene aerial images, and each is assigned OSM labels.We visually inspect 14,000 images and correct their labels, yielding a subset of cleanly-labeled images.• The proposed dataset provides not only ground truth data but also crowdsourced labels, which enables researches in learning from enormous noisy labels for our task.• We extensively evaluate commonly-used classification networks on both MultiScene-Clean and MultiScene and provide benchmarks for recognizing multiple scenes in single images and learning from noisy labels for this task, respectively.
The remaining sections of this paper are organized as follows.Section II reviews studies in aerial single-scene classification and multi-label object classification.Section III briefly recalls existing scene datasets and delineates the proposed dataset.Experimental configurations and results are exhibited in Section IV, and Section V draws a conclusion.

II. RELATED WORK
This section briefly reviews related works in two fields: aerial single-scene classification and multi-label object recognition.

A. Aerial Single-scene Classification
Aerial single-scene classification refers to categorize an aerial image into a single scene class.Early researches propose to construct scene representations with variant low-level features, e.g., local structures [41], [42], color attributes [43], [44], and texture information [45], [46].Concerning that low-level features fail to comprehensively depict complex scenes, mid-level algorithms, such as Bag-of-Visual-Words (BoVW) [47], [48] and topic models [49], [50], are devised to encode local features (so-called "visual words") into more holistic mid-level scene representations for the classification task.However, these methods show limited performance in recognizing scenes of high diversity due to their dependency on hand-crafted features.Recently, the emergence of deep CNNs brings immense advancements to the community, and many achievements [19]- [34] have been obtained in the field of aerial single-scene classification.These deep networks have hierarchical architectures, where convolutional and max-pooling layers are periodically interleaved for learning high-level features of intricate scenes.With layers going deeper, the learned features are more abstract and supposed to contain richer semantic information, which is crucial for judicious decisions.A popular trend of deep learning algorithms in single-scene classification is to take a CNN as the backbone and introduce well-designed modules for further enhancing the feature efficiency.For instance, Bi et al. [31] propose to learn multiple instances from feature maps extracted by a densely-connected CNN and integrate them into bag-level features for single-scene classification.Li et al. [51] propose a key region capturing method to learn class-specific features and retain global information for inferring scene labels.To leverage features of variant levels, feature aggregation plays a key role in single-scene classification.Lu et al. [52] fuses features learned by the last three blocks and the second fully-connected layer of VGG-16, and Cao et al. [53] designs a non-parametric self-attention layer to enhance spatial and channel responses of fused features for the final prediction.In [20], the authors develop a gated bidirectional network for aggregating features extracted by different convolutional layers with a gated function in both top-down and bottomup directions.Besides, exploiting supplementary data, such as geo-tagged audios and multi-temporal images, has been a new research direction.Hu et al. [19] propose to predict scene categories by transferring sound event knowledge learned from sound-image pairs.In [25], the authors propose a two-branch network to learn deep features of bi-temporal images and fuse them through a CorrFusion module for aerial scene classification.Our literature review demonstrates that most of the existing researches assume that an aerial image includes only one scene and focus on well-cropped single-scene aerial images.Hence, these studies tend to regard entities present in an image as compositions of a scene, while in multi-scene recognition, this would trigger networks to learn erroneous feature representations.However, very few efforts have been deployed to explore multi-scene recognition in the remote sensing community.

B. Multi-label Object Classification
Multi-label object classification refers to assigning an aerial image multiple object-level labels, such as car, tree, and building.Similar to our work, these studies aim to provide a holistic understanding of aerial images, but from the perspective of object.Early attempts [54], [55] follow the idea of simply combining a deep CNN with a post-processing approach for identifying multiple objects in an aerial image.In [54], the authors feed outputs of a CNN into a customized thresholding operation for inferring multiple object labels, while in [55], a conditional random field (CRF) is utilized as the post-processing model.In recent literature, more efforts are deployed to endow deep neural networks with the capacity of reasoning about relations among various objects for more accurate predictions.In [56], the authors propose an end-toend network comprising a CNN and a long short-term memory (LSTM) network that is responsible for modeling label dependencies through its recurrent units for multi-label object classification.[57] exploits a bidirectional LSTM network to learn spatial relations among all patches in an image for the final prediction.In [58], the authors propose a relational reasoning network module to model label dependencies and gains better classification results.Instead of encoding label relations, [59] divides an aerial image into several patches with the same size and models spatial relationships among them for multi-label object interpretation.Compared to these researches, our task is more challenging, because compared to object, the concept of scene is more abstract and intricate.

III. MULTISCENE DATASET FOR MULTI-SCENE RECOGNITION IN SINGLE AERIAL IMAGES
This section first reviews existing single-scene aerial image datasets and then delineates the proposed dataset.

A. Existing Single-scene Aerial Image Dataset
During the last decades, various aerial image datasets are published for single-scene classification, and here we briefly review several commonly used ones.
• UC-Merced [37]: The UC-Merced dataset is composed of 2,100 images collected from the United States Geological Survey (USGS) National Map, and each of them is categorized into one of 21 scene classes: overpass, golf course, river, harbor, beach, building, airplane, freeway, intersection, medium residential, runway, agricultural, storage tank, parking lot, forest, sparse residential, chaparral, tennis courts, dense residential, baseball diamond, and mobile home park.The number of images per scene is evenly defined as 100, and only cities in the United States are covered in data acquisition.The size of each image is 256×256 pixels, and the spatial resolution is one foot.In [60], the authors focus on the task of recognizing multiple objects in an image and relabel the UC-Merced dataset, yielding a multi-label dataset.In this dataset, 2,100 images are relabeled, and each is assigned one or several labels from 17 newly defined object classes: airplane, sand, pavement, building, car, chaparral, court, tree, dock, tank, water, grass, mobile home, ship, bare soil, sea, and field.• WHU20 [38]: The WHU20 dataset is an extended version of the WHU-RS dataset that was originally proposed in [61].This dataset expands numbers of aerial images and scene classes from 950 to 5,000 and from 12 to 20, respectively.For each scene category, more than 200 images with a size of 600 × 600 pixels are collected, and their spatial resolutions range from 0.26 m/pixel to 7.44 m/pixel.• RSSCN7 [39]: The RSSCN7 dataset is a collection of 2,800 high-resolution images each belonging to one of 7 scene categories: grassland, forest, farmland, parking lot, river/lake, industrial region, and residential region.400 images with different spatial resolutions are cropped from Google Earth imagery for each scene, and the image size is 400 × 400 pixels.
• AID [27]: The AID dataset is a large-scale benchmark consisting of 10,000 aerial images and 30 scene types: airport, pond, forest, baseball field, resort, bare land, center, beach, bridge, commercial, desert, storage tanks, farmland, industrial, mountain, park, parking, playground, viaduct, church, railway station, river, school, meadow, sparse residential, dense residential, medium residential, square, stadium, and port.Google Earth is exploited to acquire image samples, and the spatial resolution of each sample varies from 0.5 m/pixel to 8 m/pixel.The size of images is 600 × 600 pixels, and the number of images for each class ranges from 220 to 420.• NWPU-RESISC45 [40]: The NWPU-RESISC45 dataset contains 31,500 high-resolution images and each is as-signed with one of 45 scene labels.For each scene, 700 images with a size of 256 × 256 pixels are acquired from Google Earth imagery, and their spatial resolutions vary from 0.2 m/pixel to 30 m/pixel.In addition, we note that BigEarthNet [62] is a largescale dataset for multi-label learning, where 590,326 Sentinel-2 images are captured over the European Union, and their spatial resolutions range from 10 m/pixel to 60 m/pixel.Since BigEarthNet focuses on land covers instead of scenes, we do not specify it here.Table I presents an overview of public high-resolution aerial image datasets from the perspectives of dataset scales, image resolutions, scene categories, and annotations.

B. MultiScene for Multi-scene Recognition
Although there are already variant datasets for aerial scene recognition, most of them can only be used for single-scene classification.In this paper, we aim to take a step towards a more general scenario, multi-scene recognition in single images, and produce the MultiScene dataset.
To be more specific, we collect 100,000 high-resolution aerial images from Google Earth imagery, which cover six continents, Europe, Asia, North America, South America, Africa, and Oceania, and eleven countries including Germany, France, Italy, England, Spain, Poland, Japan, the United States, Brazil, South Africa, and Australia (cf.Fig. 3).This can ensure high intra-class diversity, as different scene appearances resulted from different cultural regions are covered.The spatial resolution of each image ranges from 0.3 m/pixel to 0.6 m/pixel, and the spatial size of images is 512 × 512 pixels.In contrast to single-scene image datasets [27], [37]- [39], we put no constraints on the location and area of the dominant/trivial scene in an image during the data collection process.Some example multi-scene images are exhibited in Fig. 4. In total, 36 scene categories are defined: apron, baseball field, basketball field, beach, bridge, cemetery, commercial, farmland, woodland, golf course, greenhouse, helipad, lake/pond, oil field, orchard, parking lot, park, pier, port, quarry, railway, residential, river, roundabout, runway, soccer field, solar farm, sparse shrub, stadium, storage tanks, tennis court, train station, wastewater, plant, wind turbine, works, and sea.
To obtain crowdsourced annotations, we first localize each image in OSM with coordinates of its four corners.Afterwards, we parse properties of scenes present in the corresponding region and label images accordingly.In this way, crowdsourced annotations of all aerial images can be automatically yielded at a very low cost compared to conventional manual labeling.However, these almost free annotations might suffer from noise as aforementioned in Section I, and the performance of networks directly trained on them could be degraded.Therefore, we visually inspect 14,000 images from all six continents and correct their labels, yielding a subset, MultiScene-Clean.Fig. 3 shows the coordinate distribution of all images, and the number of samples associated with each scene is present in Fig. 5. Compared to other scene recognition datasets (cf.Table I), our dataset is featured by its manifold labels per image and the available crowdsourced annotations.Fig. 6 further shows the number of images associated with different numbers of scenes.

C. Challenges
Compared to existing aerial scene datasets, our dataset brings more challenges to the field of scene interpretation from the following three perspectives: • Images are unconstrained and large-scale, and thus scenes are likely to be incomplete and trivial, which makes recognition more difficult.• The long-tail sample distribution (see Fig. 5) poses a challenge of learning unbiased models on an imbalanced dataset.
• We gather images from different cultural regions, which results in a high intra-class variation.
IV. EXPERIMENTS

A. Experimental Setup
Data Configuration.We evaluate the performance of existing models on both MultiScene-Clean and MultiScene datasets.As to the former, we use 7,000 cleanly labeled images to train and validate networks, and the remaining images are utilized to test networks.For the latter, we leverage the same test set but train deep neural networks on the other 93,000 images with only crowdsourced annotations.
Evaluation.For a comprehensive evaluation, we measure the performance of baseline models with class-based, examplebased, and overall metrics.Let L and N be numbers of classes and examples3 , these metrics are calculated as follows.
• Class-based Metrics: Mean class-based precision (mCP), recall (mCR), F 1 (mCF 1 ) score, and per-class average precision (AP) are calculated for measuring the performance of networks from the perspective of class.Specifically, mCP, mCR, and mCF 1 score are computed as:  Y-axis indicates the number of scenes, and X-axis represents the number of images.The legend is the same as that in Fig. 5.
we calculate the corresponding AP with the following formula: where N c denotes the number of examples including the c-th class, and TP c @k and FP c @k represent numbers of true and false positives in top-k examples, respectively.Notably, TP c @k and FP c @k are equivalent to TP c and FP c , when k equals to N .rel@k denotes the relevance between the k-th example and the c-th class, and it is set to 0/1 when the c-th class is included/excluded.Besides, the mean average precision (mAP) can be computed by averaging APs for all categories.• Example-based Metrics: Mean example-based precision (mEP), recall (mER), and F 1 (mEF 1 ) score are computed to validate networks from the perspective of example with the following equations: where TP k , FP k , and FN k denote numbers of true positives, false positives, and false negatives in the k-th example.
• Overall Metrics: Overall precision (OP), recall (OR), and F 1 (OF 1 ) score can be used to measure the performance of models from a more holistic perspective, and they are calculated as: where TP, FP, and FN are counted based on predictions of all scenes and examples.

B. Baselines
To provide comprehensive benchmarks, we evaluate the performance of extensive popular deep neural networks.Since they were originally designed for single-label classification, we substitute sigmoid functions for their softmax activations to predict multiple scene labels that are encoded into multi-hot binary sequences.Besides, several classical machine learning algorithms are also evaluated.In total, 22 models are tested on both MultiScene-Clean and MultiScene datasets, and a brief review is as follows.
• SVM [63]: Support vector machine (SVM) aims to learn one or several hyperplanes for separating samples of different classes with the largest margin.Usually, the hyperplanes are constructed in a high dimensional space, and can be learned directly (Linear SVM) or through    linear bottlenecks are developed to improve the network performance.In our experiments, we train MobileNet-V2 and set both α and β as the default value, 1.
• ShuffleNet [77]: ShuffleNet improves computational efficiency by utilizing pointwise group convolutions and channel shuffle.Specifically, the former divides feature maps into several groups and conducts 1×1 convolutions on each group independently.The latter rearranges feature channels for enabling information to flow across channels belonging to different groups.Besides, element-wise addition, which is often used in a residual block, is replaced with concatenation for enlarging channel dimension at a low computational cost.In ShuffleNet-V2 [78], features are grouped by channel split, and pointwise group convolutions are discarded.As a consequence, two feature groups are yielded and fed into two branches, of which one is an identity mapping and the other is a set of convolutions.Afterwards, outputs are concatenated and shuffled along the channel dimension.In our experiments, we evaluate the performance of ShuffleNet-V2 on our dataset.
• DenseNet [79]: DenseNet proposes to enhance information flow by directly connecting each layer to all subsequent layers with equivalent feature-map sizes.To preserve information learned by proceeding layers, concatenation is employed to combine features from various layers.real-world inference latency.As a consequence, MnasNet searched on target datasets is expected to achieve a good trade-off between accuracy and latency.To control the model size, a depth multiplier is designed for scaling the number of channels in each layer.In our experiments, the depth multiplier is set to 1, and the best-performing MnasNet searched on the ImageNet dataset [84] is chosen to perform multi-scene recognition in the wild.
• KFBNet [51]: KFBNet exploits a key region capturing method namely key filter bank (KFB) for aerial image scene classification.The proposed KFB is composed of two streams: a global stream (G-Stream) and a key stream (K-Stream).The former predicts labels using features learned by the last block of a CNN, while the latter highlights key features in both spatial and channel dimensions for inferring scene categories.Finally, predictions made by the two streams are merged via an elementwise addition as the final decision.We take VGG-16 as the backbone and report numerical results in Table II, III, and V.
• FACNN [52]: FACNN is a scene classification network composed of a CNN backbone and a Feature Aggregation module.In the latter, features extracted by the last three blocks of VGG-16 are aggregated through pooling operations and 1 × 1 convolutions.Afterwards, they are concatenated with outputs of the second fullyconnected layer of VGG-16 to form discriminative scene representations for the final prediction.
• SAFF [53]: SAFF proposes a non-parametric selfattention layer for enhancing spatial and channel responses of feature maps.Specifically, features extracted by the last three blocks of a pre-trained CNN (e.g., VGG-16) are fused and fed into the proposed self-attention layer.In this layer, spatial-and channel-wise weightings are conducted to emphasize the importance of locations of salient objects and channels with infrequently occurring features, respectively.Principal component analysis (PCA) whitening is also introduced to reduce the information redundancy and squash channels.However, since this operation frequently fails the network training, we replace it with a learnable fully-connected layer.Besides, VGG-16 is selected as the backbone in our experiments.

C. Training Details
Before training SVM, RF, and XGBOOST, we use histogram of oriented gradient (HOG) [85] and local binary pattern (LBP) [86] as visual features as recommended in [87].The size of each cell is set to 32×32 pixels for HOG, and the radius is defined as 16 pixels for LBP.We use Scipy to implement these machine learning classifiers and apply them to multiscene recognition using the function MultiOutputClassifier5 .As to baseline classification neural networks, we initialize them with weights pre-trained on the ImageNet dataset and fine-tune them on the proposed multi-scene image dataset.The loss is defined as binary cross-entropy, and stochastic gradient descent (SGD) with momentum [88] is selected as the optimizer.To accelerate the network convergence, the momentum is set to a large value, 0.  II.It can be seen that ResNeXt-101 achieves the best mAP (64.8%), mEF 1 (70.2%), and OF 1 score (71.3%), which demonstrate its high performance and robustness in this task from almost all perspectives.LR-ResNet-50 gains the highest value in mCF 1 (59.0%)owing to its capability of reasoning about relations among various scenes.Moreover, such a reasoning capability also enables LR-ResNet-50 to surpass the other baselines in all recall metrics, as scenes tend to be predicted as positive once its related scenes are recognized.Another observation is that MnasNet, SqueezeNet, and ShuffleNet-V2 show relatively poor performance due to their light-weight designs.Compared to deep neural networks, traditional machine learning algorithms achieve lower scores in all metrics.
For an insight into the performance of networks in identifying different scenes, we also report per-class APs in Table III.As we can see, ResNeXt-101 achieves the highest APs in most scenes, which is in line with the previous observations.Furthermore, we note that most networks fail to accurately recognize senes having scarce training samples, e.g., oil field and port.This suggests that learning unbiased models on an imbalanced dataset is a big challenge.Besides numerical results, we exhibit several predictions in Table IV.
2) Learning from Noisy Crowdsourced Labels: We investigate networks learned from noisy crowdsourced labels for our task on the MultiScene dataset.To ensure a fair comparison, we utilize the same test set as in Section IV-D and report numerical results in Table V.It can be observed that OF 1 scores of all models are decreased by an average of 8.2% compared to the values in Table II, which demonstrates that noise in crowdsourced annotations significantly affects the learning of deep neural networks.Moreover, it is interesting to note that the values of class-based metrics, mAP and mCF 1 score, are increased by 4.6% and 1.2%, respectively, in comparison with those in Table II.This can be attributed to the fact that numbers of training samples, especially for scenes seldomly appearing, are effortlessly increased by crawling OSM data with keyword searching.Compared to models showing high performance on the MultiScene-Clean dataset, we find that DenseNet gains the highest scores in mCF 1 (59.3%),mEF 1 (62.6%), and mOF 1 (64.1%), as it can sufficiently reuse features and has relatively few parameters.Besides, LR-VGG-16 achieves the highest mAP (67.8%), which demonstrates that taking advantage of underlying relations among various scenes can suppress the influence of noise introduced by OSM data.Furthermore, we compare the performance of several networks trained on MultiScene-Clean and MultiScene datasets in Fig. 7, and it can be again observed that higher class-based scores (see orange and brown bars in Fig. 7) are obtained when using massive crowdsourced labels.All in all, although crowdsourced labels influence the overall performance of networks, comparisons in class-based scores also suggest their great potential.

V. CONCLUSION
In this paper, we propose a large-scale dataset, MultiScene, for multi-scene recognition in single images, which is featured by unconstrained multi-scene aerial images and the available both crowdsourced and clean labels.The proposed dataset allows researches in not only recognizing aerial scenes in the wild but also learning from noisy crowdsourced labels.We comprehensively evaluate popular baseline models on both MultiScene-Clean (a subset consisting of only cleanly-labeled images) and MultiScene datasets.Experimental results on the former demonstrate that unconstrained multi-scene recognition is still a challenging task, and those on the latter showcase the great potential of exploiting a large number of crowdsourced annotations.Looking into the future, the dataset can be applied to develop more efficient networks and learning strategies for exploiting noisy labels for aerial scene understanding in the wild.

Fig. 1 .
Fig. 1.Examples of images utilized in (a) single-scene and (b) multi-scene recognition tasks.In (a), each aerial image is assigned one scene label, while in (b), labels of all present scenes are inferred.In comparison with (b), (a) might suffer from partial scene understanding, as only one label is predicted even if there indeed exist multiple scenes in an image.For a clear visualization, locations of scenes are marked in (b).

Fig. 2 .
Fig. 2. Examples of (a) incomplete and (b) incorrect OSM annotations.In (a), sparse shrubs are not annotated in OSM data, while in (b), the tennis court is mislabeled as residential.

Fig. 3 .
Fig. 3. Coordinate distributions and examples of multi-scene aerial images in our dataset.Red dots denote images with both crowdsourced and clean labels, and cyan dots represent images with only crowdsourced scene labels.

Fig. 4 .
Fig. 4. Example multi-scene aerial images with their crowdsourced and clean annotations in the MultiScene dataset.

Fig. 5 .
Fig. 5. Sample distributions of all scene categories in our dataset.Each cyan bar indicates the number of images assigned only OSM labels with respect to each scene category, and red bars represent numbers of images with both OSM and clean labels.

Fig. 6 .
Fig.6.The number of images associated with different numbers of scenes.Y-axis indicates the number of scenes, and X-axis represents the number of images.The legend is the same as that in Fig.5.

Fig. 7 .
Fig. 7. Comparisons of the performance of networks trained on images with clean (light-color bars) and crowdsourced (dark-color bars) annotations, respectively.For each network, the left four bars represent class-based scores, mAPs and CF 1 , while the right four bars indicate EF 1 and OF 1 scores.

TABLE I COMPARISON
WITH EXISTING AERIAL SCENE DATASETS FROM VARIOUS PERSPECTIVES.

TABLE II NUMERICAL
RESULTS OF BASELINE MODELS ON THE MULTISCENE-CLEAN DATASET (%).MODELS ARE TRAINED AND TESTED ON CLEANLY-LABELD IMAGES, AND THE BEST SCORES ARE SHOWN IN BOLD.
[76]el functions (Nonlinear SVM).In our experiments, we select the latter and use a radial basis function (RBF) kernel[64]to learn SVM.• RF[65]: Random forest (RF) is an ensemble of decision trees, which are trained with random subspaces of image features and make final predictions through the majority voting.The number of decision trees is set to 200 in our experiments.•bottleneckarchitecturemadeof1×1convolutionsisintroduced to mitigate the boosted computational cost resulting from heavy inception modules.In TableIIand III, we report the performance of Inception-v3[71]in multi-scene recognition.4https://xgboost.readthedocs.io/en/latest/tutorials/model.html•ResNet[73]:ResNetaims to address the degradation problem by learning residual mappings with shortcut connections.By doing so, ResNet can go much deeper than plain CNNs and achieve outstanding performance in not only image classification but also semantic segmentation and object detection tasks.In our experiments, we evaluate a 50-layer ResNet (ResNet-50), a 101-layer ResNet (ResNet-101), and a 152-layer ResNet (ResNet-152) on the proposed dataset.Notably, residual blocks in these deep ResNets are modified into bottleneck architectures for reducing the computational burden.•SqueezeNet[74]:SqueezeNetfocuseson preserving network performance with fewer parameters.To achieve this, most of 3 × 3 convolutional filters are replaced with 1 × 1 filters, and features are squeezed in the channel dimension before fed into the remaining 3 × 3 filters.In conducted on each channel, and the latter aggregates channel-wise outputs via 1 × 1 convolutions.To further reduce the computational cost, two hyperparameters, width multiplier α and resolution multiplier β, are designed to shrink feature channels and input resolutions, respectively.In the advanced variation of MobileNet, i.e., MobileNet-V2[76], inverted residual connections and

TABLE IV EXAMPLE
PREDICTIONS OF RESNEXT-101 ON THE MULTISCENE-CLEAN DATASET.

TABLE V NUMERICAL
RESULTS OF BASELINE MODELS ON THE MULTISCENE DATASET (%).MODELS ARE TRAINED ON IMAGES WITH NOISY CROWDSOURCED ANNOTATIONS AND TESTED ON CLEANLY-LABELED IMAGES.THE BEST SCORES ARE SHOWN IN BOLD.
9. Besides, the initial learning rate and weight decay are set to 0.02 and 1e − 4, respectively.All deep networks are implemented on Pytorch and validated on one NVIDIA Tesla V100-SXM2 32GB GPU.For experiments on both MultiScene-Clean and MultiScene, we train networks for 87k and 581k iterations, respectively, and the size of each training batch is set to 16 for both versions.To evaluate baselines for our task, we conduct experiments on the MultiScene-Clean dataset and report quantitative results in Table D. Experimental Results across Different Tasks1) Multi-scene Recognition with Cleanly-labeled Data: