All Grains, One Scheme (AGOS): Learning Multi-grain Instance Representation for Aerial Scene Classification

Aerial scene classification remains challenging as: 1) the size of key objects in determining the scene scheme varies greatly; 2) many objects irrelevant to the scene scheme are often flooded in the image. Hence, how to effectively perceive the region of interests (RoIs) from a variety of sizes and build more discriminative representation from such complicated object distribution is vital to understand an aerial scene. In this paper, we propose a novel all grains, one scheme (AGOS) framework to tackle these challenges. To the best of our knowledge, it is the first work to extend the classic multiple instance learning into multi-grain formulation. Specially, it consists of a multi-grain perception module (MGP), a multi-branch multi-instance representation module (MBMIR) and a self-aligned semantic fusion (SSF) module. Firstly, our MGP preserves the differential dilated convolutional features from the backbone, which magnifies the discriminative information from multi-grains. Then, our MBMIR highlights the key instances in the multi-grain representation under the MIL formulation. Finally, our SSF allows our framework to learn the same scene scheme from multi-grain instance representations and fuses them, so that the entire framework is optimized as a whole. Notably, our AGOS is flexible and can be easily adapted to existing CNNs in a plug-and-play manner. Extensive experiments on UCM, AID and NWPU benchmarks demonstrate that our AGOS achieves a comparable performance against the state-of-the-art methods.

remains challenging due to some unique characteristics: average object quantity. All the original statistics are quoted from [16]. It can be clearly seen that objects from aerial images are much more varied in sizes and each aerial image usually has much more objects. (c) & (d): Example on the dramatically varied object size and huge object amount in aerial images.
1) More varied object sizes in aerial images. As both the spatial resolution and viewpoint of the sensor vary greatly in aerial imaging [1], [17], [18], the object size from bird view is usually more varied compared with the ground images. Specifically, the objects in ground images are usually middlesized. In contrast, there are much more small-sized objects in aerial images but some of the objects such as airport and roundabout are extremely large-sized. As a result, the average object size from aerial images is much higher than the ground images (shown in Fig. 1 (a) & (c)).
Thus, it is difficult for existing convolutional neural networks (CNNs) with a fixed receptive field to fully perceive the scene scheme of an aerial image due to the more varied sizes of key objects [1], [5], [19]- [21], which pulls down the understanding capability of a model for aerial scenes.
2) More crowded object distribution in aerial images. Due to the bird view from imaging platforms such as unmanned aerial vehicles and satellites, the aerial images are usually large-scale and thus contain much more objects than ground images [1], [2], [22] (see Fig. 1

(b) & (d) for an example).
Unfortunately, existing CNNs are capable of preserving the global semantics [11]- [13] but are unqualified to highlight the key local regions [23], [24], i.e., region of interests (RoIs), of a scene with complicated object distributions. Therefore, CNNs are likely to be affected by the local semantic information irrelevant to the scene label and fail to predict the correct scene scheme [2], [25]- [28] (see Fig. 2 for an intuitive illustration).

B. Motivation & Objectives
We are motivated to tackle the above challenges in aerial scene classification, hoping to build a more discriminative aerial scene representation. Specific objectives include: 1) Highlighting the key local regions in aerial scenes. Great effort is needed to highlight the key local regions of an aerial scene for existing deep learning models, so as to correctly perceive the scene scheme rather than activate the background or other local regions in an aerial scene.
Therefore, the formulation of classic multiple instance learning (MIL) [29], [30] is adapted in our work to describe the relation between the aerial scene (bag) and the local image patches (instances). This formulation helps highlight the feature responses of key local regions, and thus enhances the understanding capability for the aerial scene.
2) Aligning the same scene scheme for multi-grain representation. Allowing for the varied object sizes in an aerial scene, it is natural to use existing multi-scale convolutional features [18]- [21] for more discriminative aerial scene representation. However, given the aforementioned complicated object distribution in the aerial scene, whether the representation of each scale learnt from existing multi-scale solutions can focus on the scene scheme remains to be an open question but is crucial to depict the aerial scenes.
Hence, different from existing multi-scale solutions [31], we extend the classic MIL formulation to a multi-grain manner under the existing deep learning pipeline, in which a set of instance representations are built from multi-grain convolutional features. More importantly, in the semantic fusion stage, we develop a simple yet effective strategy to align the instance representation from each grain to the same scene scheme.

C. Contribution
To realize the above objectives, our contribution in this paper can be summarized as follows. (1) We propose an all grains, one scheme (AGOS) framework for aerial scene classification. To the best of our knowledge, we are the first to formulate the classic MIL into deep multi-grain form. Notably, our framework can be adapted into the existing CNNs in a plug-and-play manner. (2) We propose a bag scheme self-alignment strategy, which allows the instance representation from each grain to highlight the key instances corresponding to the bag scheme without additional supervision. Technically, it is realized by our self-aligned semantic fusion (SSF) module and semantic-aligning loss function. (3) We propose a multi-grain perception (MGP) module for multi-grain convolutional feature extraction. Technically, the absolute difference from each two adjacent grains generates more discriminative aerial scene representation. (4) Extensive experiments not only validate the state-of-the-art performance of our AGOS on three aerial scene classification benchmarks, but also demonstrate the generalization capability of our AGOS on a variety of CNN backbones and two other classification domains. This paper is an extension of our conference paper accepted by the ICASSP 2021 [32]. Compared with [32], the specific improvement of this paper includes: 1) The newlydesigned bag scheme self-alignment strategy, realized by our SSF module and the corresponding loss function, is capable to align the bag scheme to the instance representation from each grain; 2) We design a multi-grain perception module, which additionally learns the base instance representation, to align the bag scheme and to highlight the key local regions in aerial scenes; 3) Empirically, our AGOS demonstrates superior performance of against our initial version [32]. Also, more experiments, discussion and visualization are provided to analyze the insight of our AGOS.
The remainder of this paper is organized as follows. In Section II, related work is provided. In Section III, the proposed method is demonstrated. In Section IV, we report and discuss the experiments on three aerial image scene classification benchmarks. Finally in Section V, the conclusion is drawn.

A. Aerial scene classification
Aerial scene classification remains a heated research topic for both the computer vision and the remote sensing community. In terms of the utilized features, these solutions are usually divided into the low-level (e.g., color histogram [33], wavelet transformation [34], local binary pattern [35], [36] and etc.), middle-level (e.g., bag of visual words [37], potential latent semantic analysis [38], [39], latent dirichlet allocation [40] and etc.) and high-level feature based methods.
High-level feature methods, also known as deep learning methods, have become the dominant paradigm for aerial scene classification in the past decade. Major reasons accounting for its popularity include their stronger feature representation capability and end-to-end learning manner [41], [42].
Meanwhile, although recently vision transformer (ViT) [56]- [58] have also been reported to achieve high classification performance for remote sensing scenes, as they focus more on the global semantic information with the self-attention mechanism while our motivation focus more on the local semantic representation and activation of region of interests (RoIs). Also, the combination of multiple instance learning and deep learning is currently based on the CNN pipelines [2], [23], [59]- [61]. Hence, the discussion and comparison of ViT based methods are beyond the scope of this work.
To sum up, as the global semantic representation of CNNs is still not capable enough to depict the complexity of aerial scenes due to the complicated object distribution [2], [25], how to properly highlight the region of interests (RoIs) from the complicated background of aerial images to enhance the scene representation capability still remains rarely explored.

B. Multi-scale feature representation
Multi-scale convolutional feature representation has been long investigated in the computer vision community [62], [63]. As the object sizes are usually more varied in aerial scenes, multi-scale convolutional feature representation has also been widely utilized in the remote sensing community for a better understanding of aerial images.
Till now, multi-scale feature representation for aerial images can be classified into two categories, that is, using multi-level CNN features in a non-trainable manner and directly extracting multi-scale CNN features in the deep learning pipeline.
For the first category, the basic idea is to derive multi-layer convolutional features from a pre-trained CNN model, and then feed these features into a non-trainable encoder such as BoW or LDA. Typical works include [19], [21], [43]. Although the motivation of such approaches is to learn more discriminative scene representation in the latent space, they are not end-to-end and the performance gain is usually marginal.
For the second category, the basic idea is to design spatial pyramid pooling [20], [45] or image pyramid [18] to extend the convolutional features into multi-scale representation. Generally, such multi-scale solutions can be further divided into four categories [31], namely, encoder-decoder pyramid, spatial pyramid pooling, image pyramid and parallel pyramid. Although nowadays multi-scale representation methods become mature, whether the representation from each scale can effectively activate the RoIs in the scene has not been explored.

C. Multiple instance learning
Multiple instance learning (MIL) was initially designed for drug prediction [29] and then became an important machine learning tool [30]. In MIL, an object is regarded as a bag, and a bag consists of a set of instances [64]. Generally speaking, there is no specific instance label and each instance can only be judged as either belonging or not belonging to the bag category. This formulation makes MIL especially qualified to learn from the weakly-annotated data [61], [65], [66].
On the other hand, the classic MIL theory has also been enriched. Specifically, Sivan et al. [76] relaxed the Boolean OR assumption in MIL formulation, so that the relation between bag and instances becomes more general. More recently, Alessandro et al. [77] investigated a three-level multiple instance learning. The three hierarchical levels are in a vertical manner, and they are top-bag, sub-bag, instance, where the sub-bag is an embedding between the top-bag and instances. Note that our deep MIL under multi-grain form is quite distinctive from [77] as our formulation still has two hierarchical levels, i.e., bag and instances, and the instance representation is generated from multi-grain features.
In the past few years, deep MIL draws some attention, in which MIL has the trend to be combined with deep learning in a trainable manner. To be specific, Wang et al. utilized either max pooling or mean pooling to aggregate the instance representation in the neural networks [61]. Later, Ilse et al. [23] used a gated attention module to generate the weights, which are utilized to aggregate the instance scores. Bi et al. [2] utilized both spatial attention module and channel-spatial attention module to derive the weights and directly aggregate the instance scores into bag-level probability distribution. More recently, Shi et al. [59], [60] embedded the attention weights into the loss function so as to guide the learning process for deep MIL.

III. THE PROPOSED METHOD
A. Preliminary 1) Classic & deep MIL formulation: For our aerial scene classification task, according to the classic MIL formulation [29], [30], a scene X is regarded as a bag, and the bag label Y is the same as the scene category of this scene. As each bag X consists of a set of instances {x 1 , x 2 , · · · , x l }, each image patch of the scene is regarded as an instance.
All the instances indeed have labels y 1 , y 2 , · · · , y l , but all these instance labels are weakly-annotated, i.e., we only know each instance either belongs to (denoted as 1) or does not belong to (denoted as 0) the bag category. Then, whether or not a bag belongs to a specific category c is determined via In deep MIL, as the feature response from the gradient propagation is continuous, the bag probability prediction Y is assumed to be continuous in [0, 1] [2], [23]. It is determined to be a specific category c via where p 1 , p 2 , · · · , p c , · · · , p C denotes the bag probability prediction of all the total C bag categories.
2) MIL decomposition: In both classic MIL and deep MIL, the transition between instances {x s } (where s = 1, 2, · · · , l) to the bag label Y can be presented as where f denotes a transformation which converts the instance set into an instance representation, g denotes the MIL aggregation function, and h denotes a transformation to get the bag probability distribution.
3) Instance space paradigm: The combination of MIL and deep learning is usually conducted in either instance space [2], [60], [61] or embedding space [23]. Embedding space based solutions offer a latent space between the instance representation and bag representation, but this latent space in the embedding space can sometimes be less precise in depicting the relation between instance and bag representation [2], [23]. In contrast, instance space paradigm has the advantage to generate the bag probability distribution directly from the instance representation [2], [61]. Thus, the h transformation in Eq. 3 becomes an identity mapping, and it is rewritten as (4) 4) Problem formulation: As we extend MIL into multi-grain form, the transformation function f in Eq. 4 is extended to a set of transformations {f t } (where t = 1, 2, · · · , T ). Then, Y is generated from all these grains and thus Eq. 4 can be presented as (5) Hence, how to design a proper and effective transformation set {f t } and the corresponding MIL aggregation function g under the existing deep learning pipeline is our major task. 5) Objective: Our objective is to classify the input scene X in the deep learning pipeline under the formulation of multi-grain multi-instance learning. To summarize, the overall objective function can be presented as where W and b is the weight and bias matrix to train the entire framework, L is the loss function and Ψ is the regularization term.
Moreover, how the instance representation of each grain f t ({x s }) is aligned to the same bag scheme is also taken into account in the stage of instance aggregation g and optimization L, which can be generally presented as where Y c denotes the category that the bag belongs to.

B. Network overview
As is shown in Fig. 3, our proposed all grains, one scheme (AGOS) framework consists of three components after the CNN backbone. To be specific, the multi-grain perception module (in Sec. III-C) implements our proposed differential dilated convolution on the convolutional features so as to get a discriminative multi-grain representation. Then, the multi-grain feature presentation is fed into our multi-branch multi-instance representation module (in Sec. III-D), which converts the above features into instance representation, and then directly generates the bag-level probability distribution. As aligning the instance representation from each grain to the same bag scheme is another important objective, we propose a bag scheme self-alignment strategy, which is technically fulfilled by our self-aligned semantic module (in Sec. III-E) and the corresponding loss function (in Sec. III-F). In this way, the entire framework is trained in an end-to-end manner.
C. Multi-grain Perception Module 1) Motivation: Our multi-grain perception (MGP) module intends to convert the convolutional feature from the backbone to multi-grain representations. Different from existing multiscale strategies [18]- [21], our module builds same-sized feature maps by perceiving multi-grain representations from the same convolutional feature. Then, the absolute difference of the representations from each two adjacent grains is calculated to highlight the differences from a variety of grains for more discriminative representation (shown in Fig. 4).
2) Dilated convolution: Dilated convolution is capable of perceiving the feature responses from different receptive field while keeping the same image size [78]. Thus, it has been widely utilized in many visual tasks in the past few years.
Generally, dilation rate r is the parameter to control the window size of a dilated convolution filter. For a 3 × 3 convolution filter, a dilation rate r means that r − 1 zerovalued elements will be padded into two adjacent elements of the convolution filter. For example, for a 3 × 3 convolution filter, a dilation rate will expand the original convolutional filter to the size of (2r + 1) × (2r + 1). Specifically, when r = 0, there is no zero padding and the dilated convolutional filter degrades into the traditional convolution filter.
3) Multi-grain dilated convolution: Let the convolutional feature from the backbone denote as X 1 . Assume there are T grains in our MGP, then T dilated convolution filters are implemented on the input X 1 , which we denote as D 1 , D 2 , · · · , D T respectively. Apparently, the set of multigrain dilated convolution feature representation X 1 from the input X 1 can be presented as where we have and t = 1, 2, · · · , T . The determination of the dilation rate r for the multi-grain dilated convolution set {D t } follows the existing rules [78] that r is set as an odd value, i.e., r = 1, 3, 5, · · · . Hence, for D t , the dilation rate r is 2t − 1.
4) Differential dilated convolution: To reduce the feature redundancy from different grains while stressing the discriminative features that each grain contains, absolute difference of each two adjacent representations in X 1 is calculated via where · denotes the absolute difference, and X d,t (t = 1, 2, · · · , T ) denotes the calculated differential dilated convolutional feature representation. It is worth noting that when t = 1, D 0 (X 1 ) means the dilated convolution degrades to the conventional convolution. Finally, the output of this MGP module is a set of convolutional feature representation X 1 , presented as where X d,0 denotes the base representation in our bag scheme self-alignment strategy, the function of which will be discussed in detail in the next two subsections. Generally, X d,0 is a direct refinement of the input X 1 in the hope of highlighting the key local regions. The realization of this objective is straight forward, as the 1 × 1 convolutional layer has recently been reported to be effective in refining the feature map and highlight the key local regions [2], [10]. This process is presented as where W d,0 and b d,0 denotes the weight and bias matrix of this 1 × 1 convolutional layer, W and H denotes the width and height of the feature representation X 1 . Moreover, as the channel number C 1 of X d,0 keeps the same with X 1 , so the number of convolutional filters in this convolutional layer also equals to the above channel number C 1 . 5) Summary: As shown in Fig. 4 and depicted from Eq. 8 to 12, in our MGP, the inputted convolutional features are processed by a series of dilated convolution with different dilated rate. Then, the absolute difference of each representation pair from the adjacent two grains (i.e., r = 1 and r = 3, r = 3 and r = 5) is calculated as output, so as to generate the multi-grain differential convolutional features for more discriminative representation.
D. Multi-branch Multi-instance Representation Module 1) Motivation: The convolutional feature representations X 1 from different grains contain different discriminative information in depicting the scene scheme. Hence, for the representation X d,t from each grain (t = 1, 2, · · · , T ), a deep MIL module is utilized to highlight the key local regions. Specifically, each module converts the convolutional representation into an instance representation, and then utilizes an aggregation function to get the bag probability distribution. All these parallel modules are organized as a whole for our multi-branch multi-instance representation (MBMIR) module.
2) Instance representation transformation: Each convolutional representation X d,t (where t = 0, 1, · · · , T ) in the set X 1 needs to be converted into an instance representation by a transformation at first, which is exactly the f function in Eq. 3 and 4. Specifically, for X d,t , this transformation can be presented as where I t is the corresponding instance representation, W d,t is the weight matrix of this 1 × 1 convolutional layer, b d,t is the bias matrix of this convolutional layer and t = 0, 1, 2, · · · , T . Regarding the channel number, assume there are overall C bag categories, then the instance representation I t also has C channels so that the feature map of each channel corresponds to the response on a specific bag category, as it has been suggested in Eq. 2. Thus, the number of 1 × 1 convolution filters in this layer is also C.
Apparently, each 1 × 1 image patch on the W × H sized feature map corresponds to an instance. As there are C bag categories and the instance representation also has C channels, each instance corresponds to a C-dimensional feature vector and thus each dimension corresponds to the feature response on the specific bag category (demonstrated in Fig. 5).
3) Multi-grain instance representation: After processed by Eq. 13, each differential dilated convolutional feature representation I t generates an instance representation at the YT Fig. 5: Illustration on the instance representation and the generation of bag probability distribution.
corresponding grain. Generally, the set of multi-grain instance representation {I t } can be presented as {I 0 , I 1 , · · · , I T }. 4) MIL aggregation function: As is presented in Eq. 4, under the instance space paradigm, the MIL aggregation function g converts the instance representation directly into the bag probability distribution. On the other hand, the MIL aggregation function is required to be permutation invariant [29], [30] so that the bag scheme prediction is invariant to the change of instance positions. Therefore, we utilize the mean based MIL pooling for aggregation.
Specifically, for the instance representation I t from each scale, assume each instance can be presented as I w,h t , where 1 ≤ w ≤ W and 1 ≤ h ≤ H. Then, the generation of the bag probability distribution Y t from this grain is presented as Apparently, after aggregation, Y t can be regarded as a C dimensional feature vector. This process can be technically solved by a global average pooling (GAP) function in existing deep learning frameworks. 5) Bag probability generation: The final bag probability distribution Y is the sum of the predictions from each grain, which is calculated as where sof tmax is the softmax function for normalization. To sum up, the pseudo code of all the above steps on learning multi-branch multi-instance representation is summarized in Algorithm 1, in which conv1d refers to the 1×1 convolution layer in Eq. 12.
E. Self-aligned Semantic Fusion Module 1) Motivation: To make the instance representation from different grains focus on the same bag scheme, we propose a bag scheme self-alignment strategy. Specifically, it at first finds the difference between a base instance representation and Algorithm 1 Learning Multi-branch Multi-instance Representation Input: convolutional feature X 1 , grain number T Output: bag probability distribution Y , instance representation set {I t } 1: zero initialization Y 2: for t = 0 → T do 3: % conv1d: the convolutional layer in Eq. 12 10: end if 12: end for 13: for t = 0 → T do 14:  the instance representations from other grains, and then minimizes this difference by our semantic aligning loss function. Fig. 6 offers an intuitive illustration of this module.
2) Base representation: The instance representation I 0 , only processed by a 1×1 convolutional layer rather than any dilated convolution, is selected as our base representation. One of the major reasons for using I 0 as the base representation is that the processing of the 1 × 1 convolutional layer can highlight the key local regions of an aerial scene.
Algorithm 2 Bag Scheme Self-alignment Strategy Input: instance representation set {I t }, bag probability distribution Y , exact bag scheme Y c Output: loss function L for optimization 1: zero initialization Y d 2: for t = 1 → T do 3: 3) Difference from base representation: The absolute difference between other instance representation I t (here t = 1, 2, · · · , T ) and the base representation I 0 is calculated to depict the differences between the base representation and the other instance representation from different grains t. This process can be presented as where · denotes the absolute difference, I d,t denotes the difference of each two instance representations at the corresponding grains, and t = 1, 2, · · · , T . 4) Bag scheme alignment: By implementing the MIL aggregation function g on I d,t , the bag probability Y d,t , depicting the difference of instance representations from adjacent grains, is generated. This process can be presented as where all the notations follow the paradigm in Eq. 14, that is, 1 ≤ w ≤ W and 1 ≤ h ≤ H, W and H denotes the width and height respectively. The overall bag scheme probability distribution differences Y d between the base instance representation I d,0 and other instance representations I d,t (where t = 1, 2, · · · , T ) can be calculated as where sof tmax denotes the softmax function. By minimizing the overall bag scheme probability differences Y d , the bag prediction from each grain tends to be aligned to the same category. Technically, this minimization process is realized by our loss function in the next subsection.
F. Loss function 1) Cross-entropy loss function: Following the above notations, still assume Y is the predicted bag probability distribution (in Eq. 15), Y c is the exact bag category and there are overall C categories. Then, the classic cross-entropy loss function serves as the classification loss L cls , presented as 2) Semantic-aligning loss function: The formulation of the classic cross-entropy loss is also adapted to minimize the overall bag probability differences Y d in Eq. 18. Thus, this semantic-aligning loss term L sealig is presented as 3) Overall loss: The overall loss function L to optimize the entire framework is the weighted average of the above two terms L cls and L sealig , calculated as where α is the hyper-parameter to balance the impact of the above two terms. Empirically, we set α = 5 × 10−4.
The pseudo code of our proposed overall bag scheme selfalignment strategy is provided in Algorithm 2, which covers the content in our subsection III-E and III-F.

IV. EXPERIMENT AND ANALYSIS
A. Datasets 1) UC Merced Land Use Dataset (UCM): Till now, it is the most commonly-used aerial scene classification dataset. It has 2,100 samples in total and there are 100 samples for each of the 21 scene categories [79]. All these samples have the size of 256×256 with a 0.3-meter spatial resolution. Moreover, all these samples are taken from the aerial craft, and both the illumination condition and the viewpoint of all these aerial scenes is quite close.
2) Aerial Image Dataset (AID): It is a typical large-scale aerial scene classification benchmark with an image size of 600×600 [17]. It has 30 scene categories with a total amount of 10,000 samples. The sample number per class varies from 220 to 420. As the imaging sensors in photographing the aerial scenes are more varied in AID benchmark, the illumination conditions and viewpoint are also more varied. Moreover, the spatial resolution of these samples varies from 0.5 to 8 meters.
3) Northwestern Polytechnical University (NWPU) dataset: This benchmark is more challenging than the UCM and AID benchmarks as the spatial resolution of samples varies from 0.2 to 30 meters [80]. It has 45 scene categories and 700 samples per class. All the samples have a fixed image size of 256 × 256. Moreover, the imaging sensors and imaging conditions are more varied and complicated than AID.

B. Evaluation protocols
Following the existing experiment protocols [17], [80], we report the overall accuracy (OA) in the format of 'average±deviation' from ten independent runs on all these three benchmarks. II: Data partition and evaluation protocols of the three aerial scene classification benchmarks following the evaluation protocols [17], [80], where runs denotes the required independent repetitions to report the classification accuracy. Experiments on UCM, AID and NWPU dataset are all in accordance with the corresponding training ratio settings. To be specific, for UCM the training set proportions are 50% and 80% respectively, for AID the training set proportions are 20% and 50% respectively, and for NWPU the training set proportions are 10% and 20% respectively.

C. Experimental Setup
Parameter settings: In our AGOS, C 1 is set 256, indicating there are 256 channels for each dilated convolutional filter. Moreover, T is set 3, which means there are 4 branches in our AGOS module. Finally, C is set 21, 30 and 45 respectively when trained on UCM, AID and NWPU benchmark respectively, which equals to the total scene category number of these three benchmarks.
Model initialization: A set of backbones, including ResNet-50, ResNet-101 and DenseNet-121, all utilize pre-trained parameters on ImageNet as the initial parameters. For the rest of our AGOS framework, we use random initialization for weight parameters with a standard deviation of 0.001. All bias parameters are set zero for initialization.
Training procedure: The model is optimized by the Adam optimizer with β 1 = 0.9 and β 2 = 0.999. Moreover, the batch size is set 32. The initial learning rate is set to be 0.0001 and is divided by 0.5 every 30 epochs until finishing 120 epochs. To avoid the potential over-fitting problem, L 2 normalization with a parameter setting of 5 × 10 −4 is utilized and a dropout rate of 0.2 is set in all the experiments.
Other implementation details: Our experiments were conducted under the TensorFlow deep learning framework by using the Python program language. All the experiments were implemented on a work station with 64 GB RAM and a i7-10700 CPU. Moreover, two RTX 2080 SUPER GPUs are utilized for acceleration. Our source code is available at https://github.com/BiQiWHU/AGOS.
Per-category classification accuracy (with ResNet-50 backbone) when the training ratios are 50% and 80% is displayed in Fig. 7 (a), (b) respectively. It is observed that almost all the samples in the UCM are correctly classified. Still, it is notable that the hard-to-distinguish scene categories such as dense residential, medium residential and sparse residential are all identified correctly.
The potential explanations are summarized as follows.
(1) Compared with ground images, aerial images are usually large-scale. Thus, the highlight of key local regions related to the scene scheme is vital. The strongest-performed approaches, both CNN based [2], [18], [25], [28], [81] and our AGOS, take the advantage of these strategies. (2) Another important aspect for aerial scene classification is to consider the case that the sizes of key objects in aerial scenes vary a lot. Hence, it is observed that many competitive approaches are utilizing the multi-scale feature representation [18], [20], [21], [45]. Our AGOS also takes advantage of this and contains a multi-grain perception module. More importantly, our AGOS further allows the instance representation from each grain to focus on the same scene scheme, and thus the performance improves. (3) Generally speaking, the performance of auto-encoder [52], [53] and GAN [54], [55] based solutions is not satisfactory, which may also be explained from the lack of the above capabilities such as the highlight of key local regions and multi-grain representation.
2) Results and comparison on AID: In Table IV, the results of our AGOS and other state-of-the-art approaches on AID are listed. Several observations can be made.
Per-category classification accuracy under the training ratio of 20% and 50% is shown in Fig. 7 (c) and (d) respectively. It can be seen that most scene categories are well distinguished, and some categories difficult to classify, i.e., dense residential, medium residential and sparse residential, are also classified well by our solution. Possible explanations include: (1) The sample size in AID is generally larger than UCM, and the key objects to determine the scene category are more varied in terms of sizes. As our AGOS can highlight the key local regions via MIL and can build a more discriminative multi-grain representation than existing multi-scale  DMSMIL with orange bar denotes the performance of our initial version [32]; AGOS with red bar denotes the performance of our current version.
aerial scene classification methods [18], [20], [21], [45], it achieves the strongest performance. (2) Highlighting the key local regions is also quite important to enhance the aerial scene representation capability for the deep learning frameworks [2], [25], [28], [81], and this can also be one of the major reasons to account for the weak performance of GAN based methods [54], [55]. (3) As there are much more training samples in AID benchmark than in UCM, the gap of representation capability between traditional hand-crafted features and deep learning based approaches becomes more obvious. In fact, it is a good example to illustrate that the traditional hand-crafted feature based methods are far from enough to depict the complexity of the aerial scenes.
3) Results and comparison on NWPU: Table V lists the percategory classification results of our AGOS and other state-ofthe-art approaches on NWPU benchmark. Several observations similar to the AID can be made.
(1) Our AGOS outperforms all the compared state-of-the-art performance when the training ratios are both 10% and 20%. Its DenseNet-121 and ResNet-101 version achieves the best and second best results on both settings, while the performance of ResNet-50 version is competitive. (2) Generally speaking, those approaches highlighting the key local regions of an aerial scene [2], [25], [28], [81], [82] or building a multi-scale convolutional feature representation tend to achieve a better performance [18], [20], [45]. (3) The performance of GAN based approaches [54], [55] degrades significantly when compared with other CNN based methods on NWPU. Specifically, they are weaker than some CNN baselines such as VGGNet and GoogLeNet.
Moreover, the per-category classification accuracy under the training ratio of 10% and 20% is shown in Fig. 7 (e), (f). Most categories of the NWPU dataset are classified well. Similar to the discussion on AID, potential explanations of these outcomes include: (1) The difference of spatial resolution and object size is more varied in NWPU than in AID and UCM. Thus, the importance of both highlighting the key local regions and building more discriminative multi-grain representation is critical for an approach to distinguish the aerial scenes of different categories. The weak performance of GAN based methods can also be accounted that no effort has been investigated on either of the above two strategies, which is an interesting direction to explore in the future. (2) As our AGOS builds multi-grain representations and highlights the key local regions, it is capable of distinguishing some scene categories that are varied a lot in terms of object sizes and spatial density. Thus, the experiments on all three benchmarks reflect that our AGOS is qualified to distinguish such scene categories.

E. Ablation studies
Apart from the ResNet-50 baseline, our AGOS framework consists of a multi-grain perception (MGP) module, a multibranch multi-instance representation (MBMIR) module and a self-aligned semantic fusion (SSF) module. To evaluate the influence of each component on the classification performance, we conduct an ablation study on AID benchmark and the results are reported in Table VI layer. Thus, more powerful representation learning strategies are needed for aerial scenes. (2) Our MBMIR module leads a performance gain of 4.17% and 3.22% respectively. Its effectiveness can be explained from: 1) highlighting the key local regions in aerial scenes by using classic MIL formulation; 2) building more discriminative multi-grain representation by extending MIL to the multi-grain form. (3) Our SSF module improves the performance by about 1% in both two cases. This indicates that our bag scheme selfalignment strategy is effective to further refine the multigrain representation so that the representation from each grain focuses on the same bag scheme. To sum up, MGP serves as a basis in our AGOS to perceive the multi-grain feature representation, and MBMIR is the key component in our MBMIR which allows the entire feature representation learning under the MIL formulation, and the performance gain is the most. Finally, our SSF helps further refine the instance representations from different grains and allows the aerial scene representation more discriminative.
F. Generalization ability 1) On different backbones: Table VII lists the classification performance, parameter number and inference time of our AGOS framework when embedded into three commonly-used backbones, that is, VGGNet [12], ResNet [11] and Inception [13] respectively. It can be seen that on all three backbones our AGOS framework leads to a significant performance gain while only increasing the parameter number and lowing down the inference time slightly. The marginal increase of parameter number is quite interesting as our AGOS removes the traditional fully connected layers in CNNs, which usually occupy a large number of parameters.
2) On classification task from other domains: Table VIII reports the performance of our AGOS framework on a medical image classification [86] and a texture classification [87] benchmark respectively. The dramatic performance gain compared with the baseline on both benchmarks indicates that our AGOS has great generalization capability on other image recognition domains.

G. Discussion on bag scheme alignment
Generally speaking, the motivation of our self-aligned semantic fusion (SSF) module is to learn a discriminative aerial scene representation from multi-grain instance-level representations. However, in classic machine learning and statistical data processing, there are also some solutions that either select or fit an optimal outcome from multiple representations. Hence, it would be quite interesting to compare the impact of our SSF and these classic solutions.
To this end, four classic implementations on our bag probability distributions from multi-grain instance representations, namely, naive mean (Mean) operation, naive max (Max) selection, majority vote (MV) and least squares method (LS), are tested and compared based on the AID dataset under the 50% training ratio.   Table. VI). It can be seen that our SSF achieves the best performance while: 1) max selection shows apparent performance decline; 2) other three solutions, namely mean operation, majority vote and least square, do not show much performance difference.
To better understand how these methods influence the scene scheme alignment, Fig. 9 offers the visualized co-variance matrix of the bag probability distributions from all the test samples. Generally speaking, a good scene representation will have higher response on the diagonal region while the response from other regions should be as low as possible. It is clearly seen that our SSF has the best discrimination capability, while for the other solutions some confusion between bag probability distributions of different categories always happens.
The explanation may lie in the below aspects: 1) Our SSF aligns the scene scheme from both representation learning and loss optimization, and thus leads to more performance gain; 2) naive average on these multi-grain instance representations already achieves an acceptable scene scheme representation, and thus leaves very little space for other solutions such as least square and majority vote to improve; 3) max selection itself may lead to more variance on bag probability prediction and thus the performance declines.    Fig. 9: Visualized co-variance matrix of the bag probability distribution after scene scheme alignment, processed by mean selection (a), max selection (b), majority vote (c), least square method (d) and our AGOS (e). Ideally, the co-variance matrix of bag probability distribution should have high responses in the diagonal region and no responses in other regions.  seen that when there are about 3 or 4 grains, the classification accuracy reaches its peak. After that, the classification performance slightly declines. This implies that the combined utilization of convolutional features when the dilated rate is 1, 3 and 5 is most discriminative in our AGOS. When there are too many grains, the perception field becomes too large and the scene representation becomes less discriminative. Also, when the grain number is little, the representation is not qualified enough to completely depict the semantic representation where the key objects vary greatly in sizes.
On the other hand, the visualized samples in Fig. 8 also reveal that when the dilation rate in our MGP is too small, the instance representation tends to focus on a small local region of an aerial scene. In contrast, when the dilation rate is too large, the instance representation activates too many local regions irrelevant to the scene scheme. Thus, the importance of our scene scheme self-align strategy reflects as it helps the representation from different grains to align to the same scene scheme and refines the activated key local regions. Note that for further investigating the interpretation capability of these patches and the possibility for weakly-supervised localization task, details can be found in [60].
2) Influence of hyper-parameter α: Fig. 11 shows the classification accuracy fluctuation when the hyper-parameter  α in our loss function changes. It can be seen that the performance of our AGOS is stable when α changes. However, when it is too large, the performance shows an obvious decline. When it is too small, the performance degradation is slight.
3) Influence of differential dilated convolution: Table X lists the classification performance when every component of differential dilated convolution (DDC) in our MGP is used or not used. It can be seen that both the differential operation (D#DC) and the dilated convolution (DD#C) lead to an obvious performance gain for our AGOS. Generally, the performance gain led by the dilated convolution is higher than the differential operation as it enlarges the receptive field of a deep learning model and thus enhances the feature representation more significantly. X: Comparison of our differential dilated convolution (DDC) on the cases when not using differential operation (D#DC), not using dilated convolution (DD#C) and not using either differential operation and dilated convolution (C) on AID benchmark with ResNet-50 backbone; Metric in %.

V. CONCLUSION
In this paper, we propose an all grains, one scheme (AGOS) framework for aerial scene classification. To the best of our knowledge, it is the first effort to extend the classic MIL into deep multi-grain MIL formulation. The effectiveness of our AGOS lies in three-folds: 1) The MIL formulation allows the framework to highlight the key local regions in determining the scene category; 2) The multi-grain multi-instance representation is more capable of depicting the complicated aerial scenes; 3) The bag scheme self-alignment strategy allows the instance representation from each grain to focus on the same bag category. Experiments on three aerial scene classification datasets demonstrate the effectiveness of our AGOS and its generalization capability.
As our AGOS is capable of building discriminative scene representation and highlighting the key local regions precisely, our future work includes transferring our AGOS framework to other tasks such as object localization, detection and segmentation especially under the weakly-supervised scenarios.