Relation Network for Multilabel Aerial Image Classification

Multilabel classification plays a momentous role in perceiving intricate contents of an aerial image and triggers several related studies over the last years. However, most of them deploy few efforts in exploiting label relations, while such dependencies are crucial for making accurate predictions. Although an long short term memory (LSTM) layer can be introduced to modeling such label dependencies in a chain propagation manner, the efficiency might be questioned when certain labels are improperly inferred. To address this, we propose a novel aerial image multilabel classification network, attention-aware label relational reasoning network. Particularly, our network consists of three elemental modules: 1) a label-wise feature parcel learning module; 2) an attentional region extraction module; and 3) a label relational inference module. To be more specific, the label-wise feature parcel learning module is designed for extracting high-level label-specific features. The attentional region extraction module aims at localizing discriminative regions in these features without region proposal generation, yielding attentional label-specific features. The label relational inference module finally predicts label existences using label relations reasoned from outputs of the previous module. The proposed network is characterized by its capacities of extracting discriminative label-wise features and reasoning about label relations naturally and interpretably. In our experiments, we evaluate the proposed model on two multilabel aerial image data sets, of which one is newly produced. Quantitative and qualitative results on these two data sets demonstrate the effectiveness of our model. To facilitate progress in the multilabel aerial image classification, our produced data set will be made publicly available.


I. INTRODUCTION
Recent advancements of remote sensing techniques have boosted the volume of attainable high-resolution aerial images, and massive amounts of applications, such as urban cartography [1], [2], [3], [4], traffic monitoring [5], [6], [7], terrain surface analysis [8], [9], [10], [11], and ecological scrutiny [12], [13], have benefited from these developments.For this reason, the aerial image classification has become one of the fundamental visual tasks in the remote sensing community This work is jointly supported by the China Scholarship Council, the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant agreement No [ERC-2016-StG-714087], Acronym: So2Sat), and Helmholtz Association under the framework of the Young Investigators Group "SiPEO" (VH-NG-1018, www.sipeo.bgu.tum.de).Besides, the authors would like to thank Xinyi Liu for supporting this work with data annotation.(Corresponding author: Xiao Xiang Zhu.) Y. Hua, L. Mou, and X. X. Zhu are with the Remote Sensing Technology Institute, German Aerospace Center, 82234 Weling, Germany, and also with Signal Processing in Earth Observation, Technical University of Munich, 80333 Munich, Germany (e-mail: yuansheng.hua@dlr.de;lichao.mou@dlr.de;xiaoxiang.zhu@dlr.de).and drawn a plethora of research interests [14], [15], [16], [17], [18], [19], [20], [21].The classification of aerial images refers to assigning these images with specific labels according to their semantic contents, and a common hypothesis shared by many relevant studies is that an image should be labeled with only one semantic category, such as scene categories (see Fig. 1).Although such image-level labels [22], [23] are capable of delineating images from a macroscopic perspective, it is infeasible for them to provide a comprehensive view of objects in aerial images.To tackle this, huge quantities of algorithms have been proposed to identify each pixel in an image [24], [25], [26] or localize objects with bounding boxes [27], [28], [29].However, the acquisition of requisite ground truths (i.e., pixel-wise annotations and bounding boxes) demands enormous expertise and human labors, which makes relevant datasets expensive and difficult to access.With this intention, multi-label image classification now attracts increasing attention in the remote sensing community [30], [31], [32], [33], [34] owing to that 1) a comprehensive picture of aerial image contents can be drawn, and 2) datasets required in this task are not expensive (only image-level labels are needed).Fig. 1 illustrates the difference between image-level scene labels and object labels.As shown in this figure, although these four images are assigned with the same scene label, their multiple object labels vary a lot.It is worth noting that the identification of some objects can actually offer important cues to understand a scene more deeply.For example, the existence of building and pavement indicates a high probability that rivers in Fig. 1c and 1d are very close to areas with frequent human activities, while rivers in Fig. 1a and 1b are more likely in the wild due to the absence of human activity cues.In contrast, simply recognizing scene labels can hardly provide such information.Therefore, in this paper, we dedicate our efforts to explore an effective model for the multi-label classification of aerial images.

A. Challenges of Identifying Multiple labels
In identifying multiple labels of an aerial image, two main challenges need to be faced with.One is how to extract semantic feature representations from raw images.This is crucial but difficult especially for high-resolution aerial images, as they always contain complicated spatial contextual information.Conventional approaches mainly resort to manually crafted features and semantic models [22], [35], [36], [37], [38], while these methods cannot effectively extract high-level semantics and lead to limited performance in classification [23].Hence an efficient high-level feature extractor is desirable.
The other challenge is how to take full advantage of label correlations to infer multiple object labels of an aerial image.In contrast to single-label classification, which mainly focuses on modeling image-label relevance, exploring and modeling label-label correlations plays a supplementary yet essential role in identifying multiple objects in aerial images.For instance, the presence of ships confidently infers the cooccurrence of water or sea, while the existence of a car suggests a high probability of the appearance of pavements.Unfortunately, such label correlations are scarcely addressed in the literature.One solution is to use a recurrent neural network (RNN) to learn label dependencies.However, this is done with a chain propagation fashion, and its performance heavily depends on the learning effectiveness of its longterm memorization.Moreover, in this way, label relations are modeled implicitly, which leads to a lack of interpretability.
Overall, an efficient multi-label classification model is supposed to be capable of not only learning high-level feature representations but also modeling label correlations effectively.

B. Related Work
Zegeye and Demir [39] propose a multi-label active learning framework using a multi-label support vector machine (SVM), relying on both the multi-label uncertainty and diversity.Koda et al. [32] introduce a spatial and structure SVM for multi-label classification by considering spatial relations between a given patch and its neighbors.Similarly, Zeggada et al. [33] employs a conditional random field (CRF) framework to model spatial contextual information among adjacent patches for improving the performance of classifying multiple object labels.
With the development of computational resources and deep learning, very recent approaches mainly resort to deep networks for multi-label classification.In [31], the authors make use of a standard CNN architecture to extract feature representations and then feed them into a multi-label classification layer, which is composed of customized thresholding operations, for predicting multiple labels.In [40], the authors demonstrate that training a CNN for multi-label classification with a limited amount of labeled data usually leads to an underwhelming-performance mdoel and propose a dynamic data augmentation method for enlarging training sets.More recently, Sumbul and Demir [41] propose a CNN-RNN method for identifying labels in multi-spectral images, where a bidirectional LSTM is employed to model spatial relationships among image patches.In order to exploring inherent correlations among object labels, [34] proposes a CNN-LSTM hybrid network architecture to learn label dependencies for classifying object labels of aerial images.

C. The Motivation of Our Work
In order to explicitly model label relations, we propose a label relational inference network for multi-label aerial image classification.This work is inspired by recent successes of relation networks in visual question answering [42], object detection [43], video classification [44], activity recognition in videos [45], and semantic segmentation [46].A relation network is characterized by its inherent capability of inferring relations between an individual entity (e.g., a region in an image or a frame in a video) and all other entities (e.g., all regions in the image or all frames in the video).Besides, to increase the effectiveness of relational reasoning, we make use of a spatial transformer, which is often used to enhance the transformation invariance of deep neural networks [47], to reduce the impact of irrelevant semantic features.
More specifically, in this work, an innovative end-to-end multi-label aerial image classification network, termed as attention-aware label relational reasoning network, is proposed and characterized by its capabilities of localizing label-specific discriminative regions and explicitly modeling semantic label dependencies for the task.This paper's contributions are threefold.
• We propose a novel multi-label aerial image classification network, attention-aware label relational reasoning network, which consists of three imperative components: a label-wise feature parcel learning module, an attentional region extraction module, and a label relational inference module.To our best knowledge, it is the first time that the idea of relation networks is employed to predict multiple object labels of aerial images, and experimental results demonstrate its effectiveness.• We extract attentional regions from the label-wise feature parcels in a proposal-free fashion.Particularly, a learnable spatial transformer is employed to localize attentional regions, which are assumed to contain discriminative information, and then re-coordinate them into a given size.By doing so, attentional feature parcels can be yielded.• To facilitate progress in the multi-label aerial image classification, we produce a new dataset, AID multi-label dataset, by relabeling images in the AID dataset [23].
In comparison with the UCM multi-label dataset [48], the proposed dataset is more challenging due to diverse spatial resolutions of images, more scenes, and more samples.
The remaining sections of this paper are organized as follows.Section II delineates three elemental modules of our proposed network, and Section III introduces experiments, where experimental setups are given and results are analyzed and discussed.Eventually, Section IV draws a conclusion of this paper.

II. METHODOLOGY A. Network Architecture
As illustrated in Fig. 2, the proposed network comprises three components: a label-wise feature parcel learning module, an attentional region extraction module, and a label relational inference module.Let L be the number of object labels and l be the l-th label.The label-wise feature parcel learning module is designed to extract high-level feature maps X l with K channels, termed as feature parcel (for more details refer to Section II-B), for each object l.The attentional region extraction module is used to localize discriminative regions in each X l and generate an attentional feature parcel A l , which is supposed to contain the most relevant semantics with respect to the label l.Finally, relations among A l and all other labelwise attentional feature parcels are reasoned about by the label relational inference module for predicting the presence of the object l.
Details of the proposed network are introduced in the remaining sections.

B. Label-wise Feature Parcel Learning
The extraction of high-level features is crucial for visual recognition tasks, and many recent studies adopt CNNs owing to their remarkable performance in learning such features [15], [49], [50], [51], [52].Hence, we take a standard CNN as the backbone of the label-wise feature parcel learning module in our model.As shown in Fig. 2, an aerial image is first fed into a CNN (e.g., VGG-16), which consists of only convolutional and max-pooling layers, for generating high-level feature maps.Afterwards, these features are encoded into L feature parcels via a label-wise multi-modality feature learning layer, where KL convolutional filters with the size of 1 × 1 are employed.The channel dimensionality of output features is KL, while the spatial dimensionality is unchanged.With this design, K feature maps are learned for each object l, so called feature parcel, and denoted as X l .After iterative training, these feature parcels are expected to contain discriminative label-related semantic information.
In our experiment, we notice that X l with a higher resolution is beneficial for the subsequent module to localize discriminative regions, as more spatial contextual cues are included.Accordingly, we discard the last max-pooling layer in VGG-16, leading to a spatial size of 14 × 14 for outputs.Moreover, we extend our researches to GoogLeNet (Inception-v3) [53] and ResNet (ResNet-50 in our case) [54] for a comprehensive evaluation.Specifically, we adapt GoogLeNet by removing global average pooling and fully-connected layers as well as reducing the stride of convolutional and pooling layers in "mixed8" to 1 to improve the spatial resolution.Besides, in order to preserve receptive fields of subsequent convolutional layers, filters in "mixed9" are replaced with atrous convolutional filters, and the dilation rate is defined as 2. Regarding ResNet, we set the convolution stride and dilation rate of filters as 1 and 2, respectively, in the last residual block.Global average pooling and fully-connected layers are removed as well.

C. Attentional Region Extraction Module
Although label-wise feature parcels can be directly applied to exploring label dependencies [34], less informative regions (see blue areas in Fig. 3) may bring noise and further reduce the effectiveness of these feature parcels.As shown in the left image of Fig. 3, weakly activated regions indicate a loose relevance to the corresponding object label, while highlighted regions suggest a strong region-label relevance.To diminish the influence of irrelated regions, we employ an attentional region extraction module to automatically extract discriminative regions from label-wise feature parcels.
We localize and re-coordinate attentional regions from X l with a learnable spatial transformer.Particularly, we sample a feature parcel X l into a regular spatial grid G X l (cf.green dots in the left image of Fig. 3) according to the spatial resolution of X l and regard pixels in X l as points on the grid G X l with coordinates (x l , y l ).Similarly, we can define coordinates of a new grid, attentional region grid G X attn l (see white dots in the middle image of Fig. 3), as (x attn l , y attn l ), and the number of grid points along with the height and width is equivalent to that of G X l .As demonstrated in [47] that G X attn l can be learned by performing spatial transformation on G X l , (x attn l , y attn l ) can be calculated with the following equation: where M T l is a learnable transformation matrix, and grid coordinates, x l and y l , are normalized to [−1, 1].Considering that this module is designed for localization, we only adopt scaling and translation in our case.Hence Eq. 1 can be rewritten as where s x l and s y l indicate scaling factors along x-and y-axis, respectively, and t x l and t y l represent how feature maps should be translated along both axes.Notably, since different objects distribute variously in aerial images, M T l is learned for each object l individually.In other words, extracted attentional regions are label-specific and capable of improving the effectiveness of label-wise features.
As to the implementation of this module, we first vectorize X l with a flatten function and then employ a localization layer (e.g., a fully connected layer) to estimate elements in M T l from the vectorized X l .Afterwards, attentional region grid coordinates (x attn l , y attn l ) can be learned from (x l , y l ) with Eq. 2, and values of pixels at (x attn l , y attn l ) is able to be obtained from neighboring pixels by bilinear interpolation.Finally, the attentional region grid G X attn l is re-coordinated to a regular spatial grid, which shares an identical structure with G X l , for yielding the final attentional feature parcel A l .

D. Label Relational Inference Module
Being the core of our model, the label relational inference module is designed to fully exploit label interrelations for inferring existences of all labels.Before diving into this module, we define the pairwise label relation as a composite function with the following equation: where the input is a pair of attentional feature parcels, A l and A m , and l and m range from 1 to L. The functions g θ lm and f φ are used to reason about the pairwise relation between label l and m.More specifically, the role of g θ lm is to reason about whether there exist relations between the two objects and how they are related.In previous works [42], [45], a multilayer perceptron (MLP) is commonly employed as g θ lm for its simplicity.However, spatial contextual semantics are not taken into account in this way.To address such issue, here, we make use of 1 × 1 convolution instead of an MLP to explore spatial information.Furthermore, f φ is applied to encode the output of g θ lm into the final pairwise label relation LR(A l , A m ).In our case, f φ consists of a global average pooling layer and an MLP, which finally yields the relation between label l and m.Following the motivation of our work, we infer each label by accumulating all related pairwise label relations, and the accumulated label relation for object l is defined as: where * represents all attentional feature parcels except A l .
Based on this formula, we implement the label relational inference module with the following steps (taking the prediction of label l as an example): 1) A l and every other attentional feature parcel are concatenated and fed into a 1 × 1 convolutional layer, respectively.2) Afterwards, a global average pooling layer is employed to transform g θ lm (A l , A m ) into vectors, which are then element-wise added.3) Finally, the output is fed into an MLP layer with trainable parameters φ to produce the accumulated label relation LR(A l , * ).Since we expect the model to predict probabilities, an activation function σ is utilized to restrict each output digit to [0, 1].For label l, a digit approaching 1 implies a high probability of its presence, while one closing 0 suggests the absence.Fig. 4 presents an visual illustration of the label relational inference module.
Compared to other multi-label classification methods, our model has three benefits: 1) The module can inherently reason about label relations as indicated by Eq. 3 and requires no particular prior knowledge about relations among all objects.That is to say, our network does not need to learn how to compute label relations and which object relations should be considered.All relations are automatically learned through a data-driven way and proven to meet the reality in our experiments.
2) The learning effectiveness is independent of long shortterm memory, leading to increased robustness.This is because, in Eq. 4, accumulated label relations are calculated with a summation function instead of a chain architecture, e.g., an LSTM.
3) The function g θ lm is learned for each object pair l and m separately, which suggests that pairwise label relations are encoded in a specific way.Besides, our implementation of g θ lm can extend the applicability of relational reasoning compared to using an MLP.

III. EXPERIMENTS AND DISCUSSION
In this section, we conduct experiments on the UCM [48] and proposed AID multi-label dataset for evaluating our model.Specifically, Section III-A presents a description of these two datasets.Afterwards, we introduce training strategies and thoroughly discuss experimental results in the subsequent subsections.
A. Dataset Introduction 1) UCM multi-label dataset: UCM multi-label dataset [48] is reproduced by assigning all aerial images collected in UCM dataset [22] with newly defined object labels.The number of all candidate object labels is 17: building, sand, dock, court, tree, sea, bare soil, mobile home, ship, field, tank, water, grass, pavement, chaparral, and car.It is worth noting that labels, such as tank, airplane, and building, exist in both [22] and [48] while at different levels.In [22], such terms are considered as scene-level labels due to the fact that related images can be characterized and depicted by them, while in [48], they mean objects that may present in aerial images.
As to properties of images in this dataset, the spatial resolution of each sample is one foot, and the size is 256×256 pixels.All images are manually cropped from aerial imagery contributed by the National Map of the U.S. Geological Survey (USGS), and there are 2100 images in total.For each object category, the number of images is listed in Table I.Besides, 80% of image samples per scene class are selected to train our model, and the other 20% of images are used to build test samples.Numbers of images assigned to training and test sets with respect to all object labels are available in Table I as well.Some visual examples are shown in Fig. 5.
2) AID multi-label dataset: In order to further evaluate our network and meanwhile promote progress in the area of multi-class classification of high-resolution aerial images, we produce a new dataset, named AID multi-label dataset, based on the widely used AID scene classification dataset [23].The AID dataset consists of 10000 high-resolution aerial images collected from worldwide Google Earth imagery, including scenes from China, the United States, England, France, Italy, Japan, and Germany.In contrast to the UCM dataset, spatial resolutions of images in the AID dataset vary from 0.5 m/pixel to 8 m/pixel, and the size of each aerial image is 600 × 600  pixels.Besides, the number of images in each scene category ranges from 220 to 420.Overall, the AID dataset is more challenging compared to the UCM dataset.
Here, we manually relabel some images in the AID dataset.
With extensive human visual inspections, 3000 aerial images from 30 scenes in the AID dataset are selected and assigned with multiple object labels, and the distribution of samples in each category is shown in Table II.Besides, 80% of all images are taken as training samples, while the rest is used for testing our model.Several example images are shown in Fig. 6.

B. Training Details
As to the initialization of our network, different modules are done in different ways.For the label-wise feature parcel learning module, we initialize the backbone and weights in other convolutional layers with a pre-trained ImageNet [55] model and a Glorot uniform initializer, respectively.Regarding the attentional region extraction module, we initialize the transformation matrix in Eq. 1 as an identical transformation, In the label relational inference module, all weights are initialized with the same strategy as that in the first module.Notably, weights in the backbone are trainable during the training phase.
In our case, multiple labels are encoded into multi-hot binary sequences instead of one-hot vectors widely used in single-label classification tasks.The length of such multi-hot binary sequence is identical to the number of total object categories, i.e., 17 in our case, and as to each digit, 0 suggests an absent object, while 1 indicates the presence of its corresponding object label.Accordingly, we define the network loss as the binary cross entropy.Besides, Adam with Nesterov momentum [56], which shows faster convergence than stochastic gradient descent (SGD) for our task, are selected and its parameters are set as recommended [56]: = 1e − 08, β 1 = 0.9, and β 2 = 0.999.The learning rate is initially defined as 1e − 04 and will decayed by a factor of 0.1 if the validation loss fails to decrease.
Our model is implemented on TensorFlow-1.12.0 and trained for 100 epochs.The computational resource is an NVIDIA Tesla P100 GPU with a 16GB memory.As a compromise between the training speed and GPU memory capacities, we set the size of training batches as 32.To avoid overfitting, the training progress is terminated once the validation loss increases continuously in five epochs.

C. Results on the UCM Multi-label Dataset
To validate the effectiveness of the proposed attentionaware label relational reasoning network (AL-RN-CNN), we compare it with the following competitors: a standard CNN, CNN-RBFNN [31], and CA-CNN-BiLSTM [34].Taking into account that the CNN is designed to perform single-label classification, we replace its last softmax layer with a sigmoid layer to produce multi-hot sequences.For all models, output sequences are binarized with a threshold of 0.5 to generate final predictions.1) Quantitative analysis: In our experiment, we employ F 1 [57] and F 2 [58] scores as evaluation metrics to quantitatively assess the performance of different models.Specifically, these two F scores are calculated with the following equation: p e r e β 2 p e + r e , β = 1, 2, where p e indicates the example-based precision and recall [59] of predictions.Formulas of calculating p e and r e are: p e = T P e T P e + F P e , r e = T P e T P e + F N e , where T P e (example-based true positive) indicates the number of correctly predicted positive labels in an example, while F P e (example-based false positive) denotes the number of those failed to be recognized.Besides, F N e (example-based false negative) represents the number of incorrectly predicted negative labels in an example.Here, an example stands for an aerial image and its associated multiple labels.
To evaluate our network comprehensively, we take mean F 1 and F 2 score as principal indexes.Moreover, we also report mean p e and mean r e .In addition to the examplebased perspective, label-based precision and recall are also considered and calculated with: to demonstrate the performance of networks from the perspective of each object label.
Table III exhibits experimental results on the UCM multilabel dataset.We can observe that our model surpasses all competitors on the UCM multi-label dataset with variant backbones.Specifically, AL-RN-VGGNet increases mean F 1 and F 2 scores by 7.16% and 5.64%, respectively, in comparison with VGGNet.Compared to CA-VGG-BiLSTM, which resorts to employing a bidirectional LSTM structure for exploring label dependencies, our network obtains an improvement of 5.92% in the mean F 1 score.Besides, although CA-VGG-BiLSTM is superior to VGGNet in both mean F 1 and F 2 scores, it achieves decreased mean precisions and recalls.In contrast, AL-RN-VGGNet outperforms VGGNet not only in mean F 1 and F 2 scores but also in mean example-and label-based precisions and recalls.For another backbone, GoogLeNet, our network gains the best mean F 1 and F 2 scores.As shown in Table III, AL-RN-GoogLeNet increases the mean F 1 score by 4.56% and 3.42% with respect to GoogLeNet and CA-GoogLeNet-BiLSTM, respectively.For the mean F 2 score and precisions, our model also surpasses other competitors, which proves the effectiveness and robustness of our method.AL-RN-ResNet achieves the best mean F 1 socre, 0.8676, and F 2 score, 0.8667, in comparison with all other models.Furthermore, it obtains the best mean examplebased precision, 0.8881, and label-based precision, 0.9233, and recall, 0.8595.To summarize, comparisons between AL-RN-CNN and other models demonstrate the effectiveness of our network.Furthermore, comparisons between AL-RN-CNN and CA-CNN-BiLSTM illustrate that explicitly modeling label relations seems better than the implicit way of LSTM-based structures.Table IV presents several example predictions from the UCM multi-label dataset.
2) Qualitative analysis: In order to figure out what is going on inside our network, we further visualize features learned from each module and validate the effectiveness of the proposed network in a qualitative manner.In Fig. 7, a couple of feature parcels regarding bare soil, building, car, pavement, court, and tank is displayed for several example images.Note that for K feature maps in each feature parcel, we select the most strongly activated one as the representative.We can observe that discriminative regions related to positive labels are highlighted in these feature maps, while less informative regions are weakly activated.As an exception, the feature map at the bottom left of Fig. 7 shows that the baseball field is misidentified as tanks, which may lead to incorrect predictions.
For evaluating the localization ability of the proposed network, we visualize attentional regions learned from the second module.Coordinates of bottom left (BL) and top right (TR) corners of attentional region grids are calculated with the following equation: Fig. 8 shows some examples of learned attentional regions.As we can see, most attentional regions concentrate on areas covering objects of interest.Besides, it is noteworthy that even objects are distributed dispersedly, the learned attentional regions can still cover most of them, e.g., buildings in Fig. 8a and cars in 8b.
Furthermore, learned pairwise label relations are visualized in the format of matrix, where an element at (l, m) indicates LR(A l , A m ).Fig. 9 exhibits some examples for the four scenes in Fig. 8.In these examples, we take only positive object labels into consideration and perform normalization  alongside each row to yield a distinct visualization of "label relations".Since m differs from l, we assign null values to diagonal elements and mark them as white color in Fig. 9.
It can be seen that in Fig. 9a and 9b, relations between car and pavement contribute significantly to predicting presences of both car and pavement.Besides, Fig. 9d shows that the existence of tree highly suggests the presence of bare soil, but not vice versa.These observations illustrate that even without prior knowledge, the proposed network can reason about relations, that are in line with the reality.

D. Results on the AID Multi-label Dataset
1) Quantitative analysis: To further evaluate the proposed network, we report experimental results on the AID multilabel dataset.Evaluation metrics in here are the same with those in previous experiments, and results are presented in Table V.As we can observe, the proposed AL-RN-CNN behaves superior to all competitors in most of the metrics.To be more specific, AL-RN-VGGNet improves the mean F 1 and F 2 score by 2.57% and 2.71%, respectively, compared to the baseline model.In comparison with CA-VGG-BiLSTM, our network gains an improvement of 1.41% in the mean   F 1 score and 1.43% in the mean F 2 score.Regarding the other two backbones, similar phenomena can be observed as well.AL-RN-GoogLeNet achieves the highest mean F 1 and F 2 score, 0.8817 and 0.8825, compared to GoogLeNet and CA-GoogLeNet-BiLSTM, while AL-RN-ResNet surpasses the second best model by 1.09% and 0.51% in the mean F 1 and F 2 score, respectively.Besides, it is noteworthy that although CA-GoogLeNet-BiLSTM shows a decreased performance compared to the baseline model, our network still achieves higher scores in all metrics.Moreover, we notice that the proposed AL-RN-CNNs outperform baseline CNNs by a large margin in the mean label-based recall, and the maximum improvement can reach 18.30%.In conclusion, these comparisons suggest that explicitly modeling label relations can improve the robustness and retrieval ability of a network.Several example predictions on the AID multi-label dataset are presented in Table IV.
2) Qualitative analysis: To dive deep into the model, we visualize label-specific features and attentional regions in Fig. 10 and 11, respectively.In Fig. 10, representative feature maps in various feature parcels for bare soil, building, car, pavement, tree, and water are displayed.As shown here, regions with label-related semantics are highlighted, while less informative regions present weak activations.For instance, regions of ponds are considered as discriminative regions for identifying water.Residential and industrial areas are strongly activated in feature maps for recognizing building.In Fig. 11, it can be observed that attentional regions learned from our network are able to capture areas of semantic objects, such as cars and trees.We also note that some attentional regions in Fig. 11 are coarser than those in Fig. 8, which is because the AID multi-label dataset has a lower spatial resolution.Furthermore, pairwise relations among positive labels are visualized in Fig. 12.As shown in Fig. 12b, 12c, and 12d, existences of both tree and pavement contribute significantly to the identification of car, while the occurrence of car only suggests a high probability that pavement presents.Strong pairwise relations between building and other labels, e.g., car, pavement, and tree, indicate that the presence of building can heavily assist in predicting those labels.

E. Discussion on the Relational Inference Module
Regarding the relational inference module, the function g θ lm is an important component, which reasons about relations between two objects.Hence, in this subsection, we discuss about different implementations of g θ lm .Specifically, we compare our AL-RN-CNN with LR-CNN [61], which employs a global average pooling layer and an MLP as g θ lm , on both the UCM and AID multi-label datasets.Experimental results are reported in  IV.CONCLUSION In this work, we propose a novel aerial image multi-label classification network, namely attention-aware label relational reasoning network.This network comprise three components: a label-wise feature parcel learning module, an attentional region extraction module, and a label relational inference module.To be more specific, the label-wise feature parcel learning module is designed to learn high-level feature parcels, which are proven to encompass label-relevant semantics, and the attentional region extraction module further generates finer attentional feature parcels by preserving only features located in discriminative regions.Afterwards, the label relational inference module reasons about pairwise relations among all labels and exploit these relations for the final prediction.In order to assess the performance of our network, experiments are conducted on the UCM multi-label dataset and a newly proposed AID multi-label dataset.In comparison with other deep learning methods, our network can offer better classification results.In addition, we visualize extracted feature parcels, attentional regions, and relation matrices for demonstrating the effectiveness of each module in a qualitative way.Looking into the future, such network architecture has several potentials, e.g., weakly supervised object detection and semantic segmentation.

Fig. 1 :
Fig. 1: Example aerial images of scene river and objects present in them.(a) bare soil, grass, tree, and water.(b) water, bare soil, and tree.(c) water, building, grass, car, tree, pavement, and bare soil.(d) water, building, grass, bare soil, tree, and sand.

Fig. 2 :
Fig. 2: The architecture of the proposed attention-aware label relational reasoning network.

Fig. 3 :
Fig. 3: Illustration of the attentional region extraction module.Green dots in the left image indicate the feature parcel grid G X l .White dots in the middle image represent the attentional feature parcel grid G X attn l , while those in the right image indicate re-coordinated G X attn l .Notably, the structure of re-coordinated G X attn l is identical to that of G X l , and values of pixels located at grid points in re-coordinated G X attn l

Fig. 5 :
Fig. 5: Samples of various scene categories in the UCM multi-label dataset as well as associated object labels.The spatial resolution of each image is one foot, and the size is 256 × 256 pixels.Scene and object labels of each sample are as follows: (a) Tennis court: tree, grass, court, and bare soil.(b) Overpass: pavement, bare soil, and car.(c) Mobile home park: pavement, grass, bare soil, tree, mobile home, and car.(d) Storage tank: tank, pavement, and bare soil.(e) Runway: pavement and grass.(f) Intersection: car, tree, pavement, grass, and building.(g) River: water, tree, and grass.(h) Medium residential: pavement, grass, car, tree, and building.(i) Harbor: ship, water, and dock.(j) Sparse residential: car, tree, grass, pavement, building, and bare soil.(k) Golf course: sand, pavement,tree, and grass.(l) Beach: sea and sand.(m) Forest: tree, grass, and building.(n) Baseball diamond: pavement, grass, building, and bare soil.(o) Airplane: airplane, car, bare soil, grass and pavement.(p) Dense residential: tree, building, pavement, grass, and car.(q) Parking lot: pavement, grass, and car.(r) building: pavement, car, and building.(s) Free way: tree, car, pavement, grass, and bare soil.(t) Chaparral: chaparral and bare soil.(u) Agricultural: tree and field.

TABLE I :
The number of images for different object categories in the UCM multi-label dataset.

TABLE II :
The number of images for different object categories in the AID multi-label dataset.

TABLE III :
Comparisons of the classification performance on UCM Multi-label Dataset (%).

TABLE IV :
Example Images and Predicted labels on the UCM and AID Multi-label Dataset.Red predictions indicate false positives, while blue predictions are false negatives.

TABLE V :
Comparisons of the classification performance on AID Multi-label Dataset (%).
Table VI.As shown in this table, our network gains the best mean F 1 and F 2 score on both datasets with variant backbones.AL-RN-VGGNet achieves the highest improvements of 3.59% and 3.82% for the mean F 1 and F 2

TABLE VI :
Comparison between different g θ lm (%).g θ lm V* F 1 G* F 1 R* F 1 V* F 2 G* F 2 R* F 2 F 1 , G * F 1 ,and R * F 1 indicate the mean F 1 score achieved by VGGNet-, GoogLeNet-, and ResNet-based networks.V * F 2 , G * F 2 , and R * F 2 indicate the mean F 2 score achieved by VGGNet-, GoogLeNet-, and ResNet-based networks.score,respectively, compared to LR-VGGNet on the UCM multi-label dataset.AL-RN-GoogLeNet increases the mean F 1 and F 2 score by 3.25% and 1.28%, respectively, in comparison with LR-ResNet on the AID multi-label dataset.Moreover, AL-RN-CNN can encode label relations through various field of views by simply changing the size of convolutional filters in g θ lm .