Few-Shot Scene Classification With Multi-Attention Deepemd Network in Remote Sensing

Recently, methods of scene classification that are based on deep learning have become increasingly mature in remote sensing. However, training an excellent deep learning model for remote sensing scene classification requires a large number of labeled samples. Therefore, scene classification with insufficient scene images becomes a challenge. The deepEMD network is currently the most popular model for solving these tasks. Although the deepEMD network obtains impressive results on common few-shot baseline datasets, it is insufficient for capturing discriminative feature information about the scene from global and local perspectives. For this reason, an efficient few-shot scene classification scheme in remote sensing is proposed by combining multiple attention mechanisms and the attention-reference mechanism into the deepEMD network in this paper. First, scene features can be extracted by the backbone that incorporates global attention module and local attention module, which enables the backbone to capture discriminative information from both the global level and the local level. Second, the attention-reference mechanism generates the weights of elements in the earth mover’s distance (EMD) formulation, which can effectively alleviate the effects of complex background and intra-class morphological differences. The experimental results on three popular remote sensing benchmark datasets, Aerial Image Dataset (AID), OPTIMAL-31, and UC Merced, illustrate that our proposed scheme obtains state-of-the-art results in few-shot remote sensing scene classification.


I. INTRODUCTION
Scene classification issue is an increasingly popular research topic in the field of remote sensing image recognition. With the rapid development of remote sensing technology, scene classification has been widely applied in urban management, national defense, land resource management and natural disaster detection [1]- [4]. Compared with common scene classification, few-shot scene classification in remote sensing is conducted under the very lack of available labeled samples, which aims to improve the dilemma faced by data-driven deep learning models, such as one-shot learning and five-shot learning [5], [6]. On the basis of the level of features for scene The associate editor coordinating the review of this manuscript and approving it for publication was Jinpeng Yu . classification, existing methods can be typically divided into traditional handcrafted features and deep learning features.
In traditional remote sensing scene classification, traditional handcrafted features can be subdivided into three types based on the morphological differences of features: texture features, structural features, and spectral features [7]. In the current literature, most of the existing models that are based on traditional handcrafted features, such as vector of locally aggregated descriptors (VLAD), locality-constrained linear coding (LLC) and spatial pyramid matching (SPM) [8]- [10], extract the features that are passed to the classifier to refine the features and enable the features to have translation invariance, scale invariance and sparsity to obtain a more robust model. In most cases, artificially designing feature descriptors are typically expensive and time-consuming, VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ which prevents models from fully utilizing the abundant semantic knowledge in remote sensing images to solve such issues.
With the rapid rise of deep learning, drawbacks of traditional machine learning methods have become more obvious. Recently, with the availability of massive amounts of training data and high-performance graphic processing units (GPU) [11], training deep neural networks to obtain deep learning features has become increasingly popular [12]- [18]. Methods that rest on deep learning features intend to employ deep neural networks, such as variational autoencoder (VAE) [19], [20], long short-term memory (LSTM) [21] and convolutional neural network (CNN) [22], which have excellent representation learning performance to obtain scene features. In previous work, the off-the-shelf CNN, such as VGG16 [23], ResNet [24], DenseNet [25] and Inception-V4 [26], is typically utilized to extract features from an image and then a classifier was adapted to achieve classification. In contrast, the amount of information that is contained in the extracted features has attracted more attention in recent research. For this reason, the features of different layers of various CNNs will be integrated to form the fused feature so that the performance of the model is promoted. For instance, the features of two fully connected layers of VGG-Net [23] are fused by discriminant correlation analysis [27]. To obtain a series of feature maps and filters for the input image, the weighted deconvolution network is employed by minimizing the reconstruction error between the input sample and the reconstructed sample [28]. In [15], the feature fusion method is employed to fuse the features of the convolutional layer and the fully connected layer in VGG-Net and CaffeNet, respectively. Then the new features that are merged by the two networks can be further merged in a linear combination. The prerequisite for these models to show excellent performance is the existence of massive amounts of labeled data. These models are data-driven because they train a deep neural network model from scratch for each new task with a large amount of labeled data by extensive parameter updates using an optimization algorithm, such as Adam, Adagrad, or SGD. Hence, if the new remote sensing scene task only has few labeled samples and lacks similar datasets, deep neural network models with a large number of parameters will easily overfit. Therefore, fast adaptation is an immense challenge when the labeled samples are extremely scarce and the differences among different datasets are large. For example, classic deep neural network models, such as AlexNet [12], ResNet [24] and DenseNet [25], can obtain up to 90% accuracy on UC Merced, OPTIMAL-31, Aerial Image Dataset (AID) and other datasets [29]- [31] but lower than 30% accuracy when only one labeled sample is given.
Consequently, to solve few-shot scene classification issues in remote sensing, it is a desirable method to promote the ability of model representation learning. Currently, most of the tasks of few-shot learning are solved by methods in meta-learning. Compared with machine learning on data, meta-learning is conducted at the task level [32]- [34], which enables the model to learn more general meta knowledge. Meta-learning has been extensively utilized in various tasks, such as regression [35], [36], classification [37], [38], and reinforcement learning [39], [40]. Meta-learning aims to learn how to learn using a series of tasks; to simulate the fewshot situation, each task is composed of a training set and a testing set, which makes the meta-learning methods quite suitable for solving the few-shot learning problem. In the field of remote sensing, manually collected scene images quite differ from collected natural images. For instance, the scenes change greatly in the intra-class, and the scenes in the interclass are similar due to the differences in shape, background and lighting [22], [41]. Hence, how to measure the degree of similarity among tasks has become another critical challenge, that is, the challenge of selecting tasks with large inter-class differences and small intra-class differences for training. Aiming at these challenges, the deepEMD network [42], attention mechanism and reference mechanism are introduced into our proposed scheme for scene classification in remote sensing. The deepEMD network is adopted to solve the overfitting of deep neural network models since labeled samples are extremely scarce. The attention mechanism can help the proposed model capture salient features. In addition, the reference mechanism is applied to yield weights of elements for calculating the earth mover's distance (EMD). Our proposed architecture is shown in Fig. 1. First, a carefully selected backbone is employed to generate the embedding of the scene image. Second, to generate discriminative image embedding, a multiple attention module that contains the global attention and local attention is adopted in the proposed architecture. Salient features can be properly extracted by the global attention module from the global perspective, that is, extracted at the feature map level, to improve the model's ability to resist noise, such as messy backgrounds and morphological differences. Local attention pays more attention to important local subtle features and can promote the performance of model representation learning by enhancing local features and suppressing other local features, which can alleviate the problems of small inter-class differences and large intra-class differences. With the multiple attention module, which makes the model pay more attention to informative features, our proposed model achieves more excellent and efficient performance. Last, classification is conducted for a query image embedding by searching for the nearest category embedding. The EMD is employed to measure the distance between two images, in which the weight between elements is calculated by the attention-reference mechanism in the EMD formulation. The remote sensing scene classification model can be trained by different scene data to recognize the new unseen scene image. Our proposed scheme is evaluated on three public datasets, and the experimental results show that our proposed multi-attention deepEMD network (MAEMD-Net) outperforms the existing methods and obtains state-ofthe-art performance in few-shot scene classification.
Our main contributions in this work are described as follows: 1) A novel framework, called MAEMD-Net, is proposed to solve few-shot scene classification in remote sensing. MAEMD-Net can construct an embedding space in which the distance between samples of the same category is small, and the distance between samples of different categories is large. 2) Global attention mechanism and local attention mechanism are proposed to promote the ability of model representation learning, and attention-reference mechanism is proposed to yield the weights between elements in the EMD formulation. Ablative analyses of different mechanisms are proposed to illustrate how each mechanism facilitates the performance of the proposed model. 3) Experiments on three popular remote sensing scene datasets, UC Merced [29], OPTIMAL-31 [30], AID [31], show that our method on few-shot tasks notably outperforms state-of-the-art methods and obtains new state-of-the-art results.
The remainder of the paper is structured as follows: The proposed framework and algorithms are introduced in Section II. Section III shows the experimental results and discussion. The conclusion is presented in Section IV.

II. THE PROPOSED SCHEME
The whole architecture of the proposed MAEMD-Net is elaborated. The training process of the proposed model is divided into two steps, namely, pre-training and meta-training. The entire model includes three main parts: backbone network, attention modules, and attention-reference mechanism.

A. PRE-TRAINING
In pre-training, multiple iterations are performed on a task until convergence. Algorithm 1 summarizes the process of pre-training, including sampling and updating of parameters ω, which is the same as the common training process. The goal of pre-training is to allow the model to learn a set of parameters to fit samples. Accordingly, pre-training allows Algorithm 1 Model Pre-Training Input: p(D): distribution over data points, α: step size hyperparameter Output: base learner 1: randomly initialize ω; 2: repeat 3: sample D ∼ p(D); 4: calculate grad = ∇L(f ω , D); 5: update parameters ω ← ω − α * grad; 6: until convergence; Algorithm 2 Model Meta-Training Input: p(T ): distribution over tasks, β: step size hyperparameter Output: meta learner 1: randomly initialize η; 2: repeat 3: sample batch of tasks τ ∼ p(T ); 4: for each τ i ∈ τ do; 5: sample {D support , D query } ∼ p(τ i ); 6: calculate embeddings of D support and D query : E support , E query ; 7: get weight support , weight query using E support , E query by attention-reference mechanism; 8: calculate flow τ i = EMD(weight support , weight query , cost_matrix); 9: calculate test loss L τ i (flow τ i , cost_matrix, D query ); 10: evaluate grad = ∇ η L τ i (flow τ i , cost_matrix, D query ); 11: update parameters η ← η − β * grad; 12: end for; 13: until convergence; the model to learn a set of general parameters ω on a task, which is convenient for fine-tuning in the meta-training stage.

B. META TRAINING
Compared with pre-training, the purpose of meta-training is to allow the model to acquire the ability to learn to learn. With this ability, the model can rapidly adapt to various tasks τ in a small quantity of samples and episodes. The proposed scheme promotes the training data from samples D to tasks τ . Each task τ i includes a support set D support and a query set D query , which is typically referred to as an episode. In addition, in an episode, N categories (N-way) and K samples (K-shot) of each category are obtained from the training set as the support set D support , and the remaining samples of the N categories are utilized as the query set D query . As shown in algorithm 2, meta-learning trains on a mass of tasks, which contains sampling tasks and updating weights η, i.e., metalevel update. Generally, training at the task level enables the model to have more robust generalization performance, and can also simulate the effect of the few-shot scenarios.

C. BACKBONE NETWORK
Given a remote sensing scene image, the model needs to be able to retrieve scene images with the same category as the query picture from the existing data set, which is typically referred to as image retrieval. Specifically, the query image is paired with the image database, and then the corresponding category of the query image can be obtained. Similarly, metric-based few-shot learning also requires image similarity measurement, whose premise is to learn an effective embedding space. In this embedding space, the input image is mapped into an embedding vector, so that the distance metric can be used to measure the level of similarity between images.
The CNN has recently achieved remarkable results in various computer vision challenges, which also confirms the effectiveness of the CNN. The CNN typically consists of a series of layers, such as the nonlinear activation function layer, convolutional layer, fully connected layer and pooling layer. By carefully designing these layers, the CNN can extract desirable hierarchical features and obtain excellent results. A CNN typically has a large number of trainable parameters, so training an excellent CNN usually requires a large amount of data as support. Compared with traditional hand-designed feature descriptors, such as SIFT and HOG, the CNN can not only extract informative features, but also substantially lower the cost of manually designed feature descriptors. There have been many well-designed network models with excellent performance, such as VGG, ResNet, DenseNet, SENet, InceptionNet and SqueezeNet. ResNet-18 is carefully selected as the backbone for extracting remote sensing scene image features in this paper. The global average pooling operation and fully connected layer at the end are removed, so that embedding feature maps with significant discrimination can be obtained.

D. MULTIPLE ATTENTION
Multiple attention consists of two components: a global attention mechanism and a local attention mechanism. From a global perspective, the discriminative information between different feature maps of the input features can be mined by using a global attention mechanism, so that category features can be extracted efficiently. Compared to the global attention mechanism, the local attention mechanism pays more attention to the salient information of the input features in different spatial positions, which can promote the representation learning ability of the model. The schematic of the multiple attention module is shown in Fig. 2. To fully utilize the computing resources of the device, multiplication is performed in parallel for the global attention module and local attention module element-wise, and then the generated features are summed element-wise to obtain discriminative features.

1) GLOBAL ATTENTION MECHANISM
In the process from an input image to feature embedding, finding a suitable embedding space has a vital role in the final model performance. A suitable embedding space should  have the characteristics that scene images that belong to the same category are located near each other in the embedding space, and scene images of different categories are separated from each other in the embedding space. However, the feature embedding that is obtained in the actual situation generally exists a large deviation from the expected feature embedding, when the samples have noise interference, such as background differences, morphological differences, etc. This interference will directly affect the embedding representations of different categories in the embedding space, which may cause the model to misjudge the input scene. Consequently, a global attention mechanism is proposed to filter out representative feature maps from feature maps and improve the model's feature extraction capability. A schematic of the global attention module is shown in Fig. 3.
The scene feature X = [x 1 , . . . , x k ] can be extracted by the backbone, where k represents the quantity of feature maps, and each feature map is represented by x i ∈ R z . Assume that the same type of scene image is represented by H = [X 1 , . . . , X n ], where n refers to the quantity of images that belong to the same category.
The input feature x is firstly conducted by the convolution operation, i.e., where the convolution operation using the filter size of 3×3 is denoted as f 3×3 (·), which can further promote the level of features. The amount of convolution kernels is the same as the initial. The non-saturated activation function, Leaky ReLU, is adopted to enhance the ability of the model for nonlinear representation. This operation is performed twice in parallel to obtain two outputs. Global max pooling and global average pooling operations are then performed to capture global information from two perspectives. Two global feature representations are obtained, that is, x GMP and x GAP , which refers to the global maximum pooling feature and the global average pooling feature, respectively, Subsequently, the two global feature representations are sent to the shared channel transformer to generate global attention features. The shared channel transformer is actually a multilayer perceptron (MLP) including an input layer, a single hidden layer and an output layer. Element-wise summation and nonlinear transformation are performed to integrate the output attention features. This transformation is described as follows: where σ refers to the sigmoid function, W 1 and W 2 represent the weight parameters of the hidden layer and the output layer, respectively, in the multilayer perceptron, and their corresponding bias parameters are b 1 and b 2 , respectively. The output features with global attention, which can be utilized to capture the vital image features from the global perspective, are achieved by element-wise multiplication and global attention of the input features: where X denotes the input feature. Considering the overall situation, the more important feature maps that have a vital role in the performance of the final model will be given more attention. Correspondingly, the feature maps that interfere with the final classification result will be suppressed. In this case, the performance of the model will also be improved.

2) LOCAL ATTENTION MECHANISM
Generally, most of the existing excellent neural network models are data-driven, that is, expected class features can only be learned with the support of massive amounts of data. Nonetheless, few-shot learning refers to training models in the setting of few training samples. These models are typically restricted and even perform worse than traditional methods. Therefore, how to make the model still learn outstanding representation ability in the case of few labeled samples is a key issue. One of the most common methods is to introduce the attention mechanism into the model to alleviate the problems caused by insufficient training samples. In a local area, some features may contribute more to the final classification result than the surrounding features. These discriminative local features are captured by the local attention mechanism. A schematic of the local attention module is shown in Fig. 4.
First, the local response normalization operation (LRN) is performed to strengthen discriminative features from a local area while suppressing neighboring features. This operation is a deep learning technique to improve accuracy. Specifically, the LRN aims to create a competitive mechanism for the activity of local neurons, which makes the value with a larger response become relatively larger and inhibits other neurons with smaller feedback to enhance the generalization ability of the model. The output feature z i x,y can be calculated by: where h i x,y represents the activation value after applying the i-th filter at position (x, y), the sum operation is to run over m adjacent filters in the same spatial position, and k represents the total amount of filters in the layer. The constants η, α, m, β are hyperparameters whose values can be adjusted according to the performance of the model on the validation set.
Second, the max-pooling operation and average-pooling operation are performed along the channel dimension with regard to new local response features z i x,y , and the output feature maps are concatenated along the channel dimension. In addition, a convolutional layer with a size of 7×7 filter is used to integrate the concatenated feature maps into a local attention feature map γ l ∈ R W ×H . The width and height of the output feature map are denoted as W and H , respectively, which are consistent with the width and height of the input image. To further refine the yielded local attention feature map, a convolutional layer with a filter size of 3×3 is added, and the amount of filters remains the same as the previous layer. Specifically, these operations can be described as: where W 3 , b 3 , W 4 and b 4 are learnable parameters, and σ refers to the sigmoid function. Last, local attention can be combined with the input features by element-wise multiplication, and then the improved features of local attention can be obtained ζ : where X denotes the input feature.

E. ATTENTION-REFERENCE MECHANISM
In the EMD formulation, the weight between each pixel has a vital role in the distance calculation. To obtain the accurate VOLUME 9, 2021 weights between the pixels of the two images, attentionreference mechanism is proposed in eq. (9).
where X denotes the input feature, γ g A represents the global attention of input feature A, and γ l A represents the local attention of input feature A. In addition, to intuitively understand attention reference mechanism, the attention-reference mechanism is visualized in Fig. 5. The module contains two sub-modules: global attention module and local attention module. Input feature B is referenced by using the global and local attention of input feature A so that the weights can be obtained. Specifically, firstly, the global attention and local attention are generated using input feature A. Input feature B successively performs the dot product on them to obtain the weights of the input feature B relative to input feature A. As mentioned above, global and local attention modules can consider more comprehensive information. Compared to other weight generation methods, our attention-reference mechanism can fully take into account the overall situation and resist the influence of the complex background.

F. DeepEMD NETWORK
The main idea of the deepEMD network intends to adopt the EMD to measure the level of similarity between the category features and the query embedding. The EMD is initially introduced to solve image retrieval problems. When pairwise distances between all element pairs are given, the optimal matching flow between the two structures can be obtained via the EMD that costs the least overhead. In our proposed method, the embedding of each category and the feature embedding of the query image is extracted via the backbone with the global attention mechanism and local attention mechanism. The weights of the elements in the EMD formulation are generated by the attention-reference mechanism.
The deepEMD network learns an embedding space, where classification is achieved by calculating the distance between the query embedding and the embedding of each category. In this work, all EMD calculations are implemented via the OpenCV framework. The EMD formulation is given in eq. (10).
where weight1 represents the weight of input feature B relative to input feature A, weight2 represents the weight of input feature A relative to input feature B, and the cost_matrix represents the cost of transforming from input feature A to input feature B.

III. EXPERIMENT AND ANALYSIS A. DATASET DESCRIPTION AND EVALUATION METRIC
All experiments are based on three remote sensing benchmark datasets, i.e., UC Merced (21 categories, 2100 images) [29], OPTIMAL-31 (31 categories, 1860 images) [30], AID (30 categories, 10000 images) [31]. Each dataset is randomly divided into two disjoint subsets according to the scene category: one subset as the base class C base for auxiliary training, and another subset as the novel class C novel for evaluation. Details of category segmentation are presented in Table 1. For each category in C base , the original segmentation method is followed by these datasets to divide the samples into training samples and validation samples. In our experiments, the former is employed to update the parameters, and the latter is employed to adjust the hyperparameters of the model. To simulate the few-shot scenario, in each training episode, the amount of categories in the support set from the base class set is equivalent to the number of categories in the support set from the novel class set, i.e., C s base = C s novel . Specifically, for 5-way 1-shot learning, C s is set to 5, and K s is set to 1. In all settings, the amount of data for each category in the query set K q is set to 15 to follow the protocol in [43]. Likewise, in the testing phase, each episode uses the same approach for the novel class set.
To accurately evaluate the performance of models, the average classification accuracy is regarded as the evaluation criterion, which is specifically defined as: where S t represents the amount of samples that are accurately classified in the i-th task, and T represents the amount of tasks, that is, the amount of episodes. As one of the most common performance metrics, it calculates the average of the multiple evaluation results.

B. EXPERIMENTAL SETTINGS
Experiments in this work are all trained and evaluated on Pytorch and implemented on the AMD Ryzen 7 3700X 8-Core processor with 3.60GHZ, RAM with 32.00 GB, and the GPU Tesla P100 with 16 GB of memory. The whole training process consists of two parts: pre-training and metatraining. The model uses the traditional method to pre-train from scratch and then adopts meta-training to promote the generalization performance of the model, which is similar to transfer learning. The Leaky ReLU is carefully selected as the activation function of the proposed model. The parameters of all convolutional layers are initialized with the kaiming normal distribution. Whether it is the pre-training stage or the meta-training stage, the stochastic gradient descent algorithm is employed as the optimizer of the proposed model, in which the weight decay and momentum values are separately set to 0.0005 and 0.9, separately. The initial value of the learning rate is 0.1. For a fair comparison, the proposed model and other benchmark models are trained on the same training set and evaluated on the same test set. In addition, all experimental results are the average values obtained in 600 episodes by following the protocol in [43].

C. COMPARISON WITH BACKBONES
In most cases, the classification accuracy is often utilized to measure the performance of neural network models. When practical problems need to be solved, the time complexity and space complexity of the model need to be taken seriously, which will determine the cost of solving this problem. For this reason, floating-point operations (FLOPs) and parameters (params) are typically applied to quantify these two kinds of complexity. For instance, for the DenseNet series of networks, as the width and depth of the network continue to increase, the classification accuracy will also increase, but it is worth noting that the hardware overhead and time overhead that are required for training will also increase. Too many parameters tend to increase the difficulty encountered by the model in obtaining the best embedding representation of the input image, and it is difficult to achieve the expected effect. When the classification performance of models is relatively close, smaller time complexity and space complexity are typically preferred. Deep neural network models with different architectures are compared using the UC Merced dataset. The classification accuracy rate on the test set is tabulated in Table 2. By comparative analysis, ResNet-18 is empirically applied as the backbone of our proposed model. Although ResNet-18 does not have the highest in classification accuracy, it has the best overall performance considering the time complexity and space complexity. Multiple attention is composed of two parts: global attention and local attention. Table 3 tabulates the experimental results with different attention mechanisms on the UC Merced dataset. For the model that is only combined with the global attention mechanism or the local attention mechanism, it can slightly pose a performance drop, while a combination of both the global attention mechanism and the local attention mechanism can achieve a noteworthy performance promotion of 4.84% over the baseline in the 5-way 10-shot setting. The reason may be that a single attention is too focused on one perspective while disregarding another perspective. The combination of the global attention mechanism and local attention mechanism (i.e., multiple attention) simultaneously achieves the best results by adopting two attention mechanisms. This confirms that there is an interaction between the two attention mechanisms and both are indispensable, since removing either mechanism causes performance degradation. Moreover, Table 3 shows that the global attention mechanism has a more vital part because of the greater degradation compared with the local attention mechanism.
To more intuitively analyze the role of multiple attention, the features extracted by the backbone without multiple attention and with multiple attention are visualized through t-SNE, as shown in Fig. 6 and Fig. 7. It is observed that multiple attention can make samples of the same category in the embedding space more clustered and samples of different categories are scattered from each other, which enables the classifier to achieve better performance when measuring the similarity between samples. Accordingly, this also verifies the superior performance of multiple attention.

E. COMPARISON WITH VARIOUS DISTANCE FUNCTIONS
To analyze the effect of the cost matrix in the EMD formulation among all element pairs calculated by using different distance metrics on the model performance, five different distance metrics are compared with the UC Merced dataset: Chebychev, Manhattan, Cosine, Dot and Euclidean. The experimental results with different distance metrics are shown in Fig. 8. For a fair comparison, all experiments are conducted on the same test dataset. When the model reaches convergence, the performance of the model with the cosine distance is prominently more superior than that of the other models on the test set, which exceeds the closest result by nearly 8%. This may be explained by the difference in the sparsity of the embedding space due to the lack of training samples.

F. COMPARISON WITH REFERENCE MECHANISM
Our proposed attention-reference mechanism is compared with the cross-reference mechanism. Fig. 9 shows the experimental results. When the multiple attention module remains the same, the proposed MAEMD-Net with the attentionreference mechanism significantly outperforms the previous DeepEMD with the cross-reference mechanism by approximately 2% in the 5-way 1-shot solution of the pre-training stage using the UC Merced validation set. This shows that the attention-reference mechanism can provide more comprehensive and accurate reference coordinates, while resisting the interference of the complex background.

G. COMPARISON WITH MODELS
To further confirm the effectiveness of the proposed scheme, five state-of-the-art models are compared as follows:

1) PROTOTYPICAL NETWORK
The prototypical network [43] attempts to find an embedding space in which the distance between samples that belong to the same category is small, while the distance between samples that belong to different categories is large. The Euclidean Distance is selected to measure the distance between samples in this embedding space, and the idea of the nearest neighbor is applied to classify the test set.

2) MAML
Model-Agnostic Meta-Learning (MAML) [36] aims to find a universal initialization, with which the model can converge to the optimum in a small number of iterations.

3) RN
Relation Network (RN) [44] is made of two modules, namely relation module and embedding module. The embedding module is responsible for learning the embedding representation of samples, while the relation module attempts to learn the method of measuring the embedding representation. Compared with non-parametric distance measures, such as the Euclidean Distance, this method obtains more intrinsic information between samples. [45] uses a meta-training method to upgrade the level of learning from samples to tasks, and then learns to learn a general metric space from a series of tasks to achieve classification in the metric space.

5) DeepEMD NETWORK
DeepEMD [42] uses the EMD as a measurement of the similarity between two images, and cross-reference mechanism is designed to yield the weights needed to calculate the EMD.
The comparative analysis between our proposed method and the previously mentioned state-of-the-art methods is presented in Table 4. It can be observed that the proposed method outperforms other methods by an immense margin as well as obtaining new state-of-the-art performance on three datasets. Although the proposed method slightly outperforms VOLUME 9, 2021 the prototypical network on the UC Merced dataset of the 5-way 5-shot scenario, it exceeds the prototypical network by more than 4% in other scenarios. Moreover, compared to other models, our proposed model also shows substantial improvement, e.g., 1-shot (3.46%) on the UC Merced dataset; 1-shot (2.31%) on the OPTIMAL-31 dataset. The significant performance improvement is mainly due to our proposed multiple attention module and attention-reference mechanism. The global attention module and local attention module help the model capture informative features from different perspectives, and the attention-reference mechanism can provide more accurate weights for the calculation of the EMD.
In addition, to further analyze the efficiency of our proposed method, the proposed method is also compared with state-of-the-art methods on the UC Merced dataset. In different few-shot scenarios, the number of classes is separately set to 3, 4 and 5, and the sample size of each class is separately set to 1, 5 and 10. As shown in Table 5, the results of the proposed model outperform all other state-of-the-art models in different few-shot scenarios. The efficiency of the proposed model is once again confirmed.

H. VISUALIZATION OF MULTIPLE ATTENTION
To deeply understand the performance effects of multiple attention module on the proposed model, some visualization examples are provided, as shown in Fig. 10, which are achieved by the Grad-CAM++ [46]. (a) Original scene images. (b) Grad-CAM++ visualization results without multiple attention module. (c) Grad-CAM++ visualization results with multiple attention module. The lighter areas in the feature map represent the greater significance of the corresponding areas for classification. Six scenes are chosen: an airplane, a harbor, an overpass, a parking lot, a river and a storage tank. Fig. 10 shows that the backbone with the multiple attention module better extracts discriminative information for these scenes. The primary objects in each scene are precisely captured, which substantially contributes scene classification and embodies the effect of the multiple attention module and the powerful representation learning capability of the MAEMD-Net.

IV. CONCLUSION
Few-shot learning has a growingly vital role in the lack of available training samples. A MAEMD-Net that is combined with the attention-reference mechanism is presented to solve the few-shot scene classification task in remote sensing. The features of scene objects can be extracted by carefully selected backone. To improve the performance of representation learning of the model with few samples, the global attention mechanism and local attention mechanism are integrated into a module in the backbone, which can not only guides the neural network to extract discriminative features from multiple perspectives but also improves the efficiency of the model. The attention-reference mechanism that combines global attention and local attention makes the EMD more suitable for similarity measurement. The evaluation results on three benchmark remote sensing scene datasets also significantly illustrate the effectiveness of our proposed scheme and surpasses state-of-the-art methods by a remarkable margin. However, our proposed scheme has some limitations. In future work, the multiple attention module will be further improved to promote the ability of representation learning of the model, and the impact of deeper and wider backbones on the performance of the model will be studied.