Divide to Attend: A Multiple Receptive Field Attention Module for Object Detection in Remote Sensing Images

The study of remote sensing image object detection has excellent research value in environmental protection and public safety. However, the performance of the detectors is unsatisfactory due to the large variability of object size and complex background noise in remote sensing images. Therefore, it is essential to improve the detection performance of the detectors. Inspired by the idea of “divide and conquer”, we proposed a Multiple Receptive Field Attention (MRFA) to solve these problems and which is a plug-and-play attention method. First, we use the method of multiple receptive field feature map generation to convert the input feature map into four feature maps with different receptive fields. In this way, the small, medium, large, and immense objects in the input feature maps are “seen” in these feature maps, respectively. Then, we used the multiple attention map fusion method to focus objects of different sizes separately, which can effectively suppress noise in the background of remote sensing images. Experiments on remote sensing object detection datasets DIOR and HRRSD demonstrate that the performance of our method is better than other state-of-the-art attention modules. At the same time, the experiments on remote sensing image semantic segmentation dataset WHDLD and classification dataset AID prove the generalization and superiority of our method.


I. INTRODUCTION
Object detection in remote sensing images is essential for aerial and satellite image processing. The main task of remote sensing image object detection is to judge whether there exist the objects we are interested in the satellite images and provide their spatial locations. Object detection in remote sensing images is currently used in resource exploration, environmental monitoring, urban planning, and military investigation, and it has significant research value.
In recent years, state-of-the-art object detection algorithms [1], [2], [3], [4], [5], [6] based on deep learning have The associate editor coordinating the review of this manuscript and approving it for publication was Amin Zehtabian . been widely used in natural scenes. These object algorithms are mainly divided into the one-stage detector and the twostage detector. One-stage detectors include You Only Look Once (YOLO) [4], RetinaNet [6], Single Shot Multibox Detector (SSD) [5], and VarifocalNet (VFNet) [7]. This type of detector classifies and locates the object in the image only once. As a result, the one-stage detector is faster but less accurate. In contrast, the two-stage detector divides the object detection task into two stages: region proposals and detection. The two-stage detector extracts candidate regions from the input image and then classifies and locates the object in the candidate regions. Compared with the onestage detector, the two-stage detector has higher performance but is slower. The representative of the two-stage detector is the R-CNN series [1], [2], [3], [8], [9], [10]. Besides, some recent anchor-free detectors have also achieved good performance. For example, Fully Convolutional One-Stage Object Detection (FCOS) [11], Adaptive Training Sample Selection (ATSS) [12], YOLOX [13], etc.
However, when we directly train and test these detectors in remote sensing images, the performance of these detectors decreases. The fundamental reason is the data distribution discrepancy between remote sensing images and natural scene images, and the detector can not well fit the distribution in the remote sensing dataset. Compared with natural scene images, remote sensing images have the following characteristics [14]: 1. Large scale variation: There are considerable pixel area discrepancies between objects in remote sensing images as the images are generally taken from high altitudes by satellites or aircraft.
2. Complex background noise: There is complex background information in remote sensing images, as shown in Figure 1.
In view of these two characteristics, some experts have been given solutions. For the problem of object scale variation, the Feature Pyramid Network (FPN) [15] module was proposed, which fuses context information through the feature pyramid to detect objects of different pixel areas. Path Aggregation Network (PANet) [16] adds a bottom-up path based on FPN to better integrate bottom-level and high-level features. Besides, researchers have proposed some multi-scale strategies to solve this problem. For example, DeepLab [17] combines the final Deep Convolutional Neural Network (DCNN) layer with a fully connected Conditional Random Field (CRF) to achieve accurate object segmentation. DeepLabv2 [18], influenced by Spatial Pyramid Pooling(SPP), proposed the Atrous Spatial Pyramid Pooling(ASPP) to handle multi-scale objects. Deep Layer Aggregation (DLA) [19] enhances the standard architecture by the more profound aggregation to enable better network performance. DeepLabv3+ [20] applies depthwise separable convolution to the ASPP and decoder module, effectively improving the performance of the encoderdecoder network. Atrous Spatial Pyramid Convolution with Encoder-Decoder (ASPC-ED) [21] combines Atrous Spatial Pyramid Convolution (ASPC) and Encoder-Decoder structure (ED) in an end-to-end manner for object detection, which effectively improves detection performance. For the problem that background noise degrades the performance of detectors in remote sensing images, attention mechanisms [22], [23], [24], [25], [26] such as Squeeze-and-Excitation Network (SE) [22], Coordinate Attention (CA) [23] and Convolutional Block Attention Module (CBAM) [24] have been proposed. SE [22] realizes the adaptive calibration of the feature map channel relationship through squeeze and excitation. CBAM [24] introduces spatial information coding through the convolution of the large-scale kernel based on channel attention. Although these attention mechanisms have achieved noticeable results in natural scene images, they perform poorly in remote sensing images due to the significant differences between remote sensing images and natural scene images.
In this paper, we proposed a new attention mechanism, termed the multiple receptive field attention mechanism (MRFA), which simultaneously focuses on objects of varying pixel areas on remote sensing images. MRFA searches objects of various pixel areas in the input feature map through the multi-receptive field feature map generation, focusing on different scale objects using the multi-attention map fusion method. To sum up, the main contributions of our paper are as follows: • We propose a plug-and-play multiple receptive field attention mechanism(MRFA) to focus on objects of different pixel areas on the feature map, which can significantly improve the performance of the detector in remote sensing images.
• MRFA adopts dilated convolution to obtain different receptive fields. In this paper, we propose a novel mapping formula to calculate the dilated rate of each dilated convolution.
• We evaluate the proposed MRFA against five state-ofthe-art attention methods on four datasets. Experiments demonstrate that the proposed MRFA achieves better or more competitive performance than previously proposed attention methods. VOLUME 10, 2022

II. RELATED WORKS
This section will review the representative studies of object detection methods and attention mechanisms used in remote sensing images.

A. RESEARCH ON OBJECT DETECTION IN REMOTE SENSING IMAGES
Many object detection methods were proposed in remote sensing images in the past. Before Convolutional Neural Network (CNN) [27] was applied to object detection, some object detection methods [28], [29], [30] cut the image into multiple regions and classify these regions to achieve the object detection task. It is essentially a classification task. Among these methods, the Bag Of Words (Bow) [28] method represents the image through the set of local regions and completes the task of detecting the object by classifying each area. Histograms of Oriented Gradients (Hog) [29] represents the object by the intensity of the gradient and the region's direction.
With the development of deep learning, many excellent methods have been proposed and are applied to the object detection task of remote sensing images. Rotation-Invariant CNN (RICNN) [31] detects rotating objects in remote sensing images by adding a rotation-invariant layer to the traditional CNN model and has achieved excellent performance. Rotated Region based CNN (RR-CNN) [32] achieves the rotated object detection of ships through the Rotated Region Of Interest (RRoI) pooling layer, Rotated Bounding Box (RBB) area module, and multi-task Non-Maximum Suppression (NMS). Markov random field-fully convolutional network (M-FCN) [33] mainly improves the detection performance of aircraft through the regional proposal generation stage based on multi Markov random field. AAF-Faster RCNN [34] applies the Additive Activation Function (AAF) to the Faster Region-based CNN (RCNN) object detection with higher efficiency and more robust performance in remote sensing images. The Split-Merge-Enhancement network (SME-Net) [35] detects objects with significant scale differences in remote sensing images by Offset-Error Rectification (OER), Feature Split-and-Merge (FSM), and Object Saliency Enhancement (OSE). It has achieved good results on several remote sensing image object detection datasets. Decoupled Classification Localization Network (DCL-Net) [36] decouples the classification task and regression task of the detector by the Receptive Field Aggregation Module (RFAM) and the Path Aggregation Module (PAM), which significantly improves the detection performance of objects in remote sensing images.

B. RESEARCH ON ATTENTION MECHANISM
In recent years, attention mechanisms have been widely used in various fields of deep learning and across multiple image tasks, such as image classification, object detection, and semantic segmentation. Among them, Squeeze-and-Excitation Network (SE) [22] realizes the adaptive calibration of the feature map channel relationship through squeeze and excitation. Efficient Channel Attention (ECA) [37] proposes a local cross-channel interaction strategy without reducing dimension and an adaptive method to select the size of the one-dimensional convolution kernel, thus achieving performance optimization. Dilated Efficient Channel Attention (DECA) [38] proposes a new multi-scale channel interaction method and a channel correlation adaptive calibration strategy based on ECA. Both the two attention methods improve the network's performance by calibrating the channel correlation of the input feature map.
In addition, Coordinate Attention (CA) [23] is an excellent attention mechanism that embeds the position information of the input feature map into the channel attention through coordinate information embedding and coordinate attention generation. Convolutional Block Attention Module (CBAM) [24] integrates the channel attention method with the spatial attention method and obtains the spatial information coding of the input feature map through the convolution of the large kernel size based on channel attention. After that, attention mechanisms such as Triplet Attention (TA) [39], and Gather-Excite Network (Genet) [40] adopt different spatial attention modules to integrate channel attention and spatial attention based on the idea of CBAM. Multi-style Attention Fusion Network (MAFNet) [41] refines low-level and mid-level semantic features through the Dual-cues Spatial Attention module (DSA) and Dual Attention Intermediate Representation module (DAIR). It obtains the high-level semantic features through the High-level Channel Attention module (HCA). Finally, fuse all the elements through the Multi-Level Feature Fusion module (MLFF), which can effectively improve the network's attention to small objects. Center-Boundary Dual Attention Network (CBDA-Net) [42] proposes a center-boundary dual attention (CBDA) module. CBDA can effectively reduce the background noise in remote sensing images by paying attention to the center and boundary features of the feature map with dual attention.
At present, many scholars have carried out extensive research on the self-attention mechanism, such as Non-Local [26], Global Context Network (GC) [25], Self-Calibrated convolution (SC) [43], and Criss-Cross attention (CC) [44], all of which use nonlocal mechanisms to obtain different types of spatial information. However, their computation is extensive, which has a significant impact on the detector's speed, and their performance in the field of remote sensing is not ideal.

III. METHODOLOGY
In this section, we first review the SE [22] and CA [23] modules and then describe the details of our MRFA module.

A. REVIEW SE AND CA 1) SE MODULE
The SE [22] module contains two components: squeeze and excitation. The squeeze can embed the feature map's global information into the feature map's channel, and the excitation can adaptively calibrate the relationship between feature map channels. The structure diagram is shown in Figure 2.
Firstly, any input X with dimension C×H×W will be squeezed by the global-average-pooling to obtain the global average feature with dimension C×1×1. Then the excitation operation is carried out to get the dependence between channels through two linear transformations. Finally, the generated channel relevance is used to recalibrate the X adaptively. SE module has been widely proved to be effective. Still, it only pays attention to the channel dimension of the feature map but ignores the importance of location information in the feature map.

2) CA MODULE
The CA [23] module mainly consists of two components: coordinate information embedding and coordinate attention generation. The structure diagram is shown in Figure 3.
For any input X with dimension C×H×W, the coordinate information is embedded by encoding each channel along the horizontal and vertical direction using average pooling in the spatial range of (H, 1) and (1, W). Then, the horizontal direction feature map with dimension C×H×1 and vertical direction feature map with dimension C×1×W are spliced in the channel dimension. The intermediate feature map is obtained by channel transformation through convolution. Next, the intermediate feature map is re-divided into two independent feature maps. The split feature map is transformed by two convolutions so that the transformed two feature maps have the same number of channels as the input X. Finally, the two feature maps are multiplied by the X. The attention results in the horizontal and vertical directions are applied to the input X. In general, CA encodes the generated feature maps to form a pair of direction-aware and position-sensitive feature maps. The feature maps can complementarily apply to the input X to enhance the representation of the object of interest.
The CA module and SE module can reduce the backbone noise of remote sensing images and guide the network to focus on the foreground object of the feature map. But on the other hand, they ignore the diversity of object pixel area on the original input X. As a result, when these attention blocks are directly applied to remote sensing images, the improvement of the network is slight. We will verify this conclusion through many experiments in Section 4. In the next part, we propose a plug-and-play multi receptive field attention mechanism, which can effectively solve the above problems.

B. MULTIPLE RECEPTIVE FIELD ATTENTION MECHANISM
The first reason for the poor performance of the detector in the field of remote sensing is that there is no suitable size receptive field to 'see' objects of different pixel areas in the feature map. The second reason is that the complex background noise in remote sensing images has caused significant interference to the detector. To address the above problems, we propose a multi receptive field attention mechanism to focus on objects with different receptive fields, which can simultaneously solve the problems of object pixel area diversity and backbone noise in remote sensing images. Our multiple receptive field attention module contains two components: multiple receptive field feature map generation and multiple attention map fusion, which we will introduce in detail next.

1) MULTIPLE RECEPTIVE FIELD FEATURE MAP GENERATION (MRF)
The first consideration of our attention module is to make the module as lightweight as possible. Inspired by the group convolution [45], we first slice the original input feature map in VOLUME 10, 2022 the channel dimension and then process each slice separately, reducing a large number of parameters. We cut the input feature map into four parts to focus on the small, medium, large, and immense objects of remote sensing images, respectively. Then, we convolve each feature map separately to obtain the feature maps with different receptive fields. However, the convolution with a larger kernel can get the feature maps with the large receptive field. Still, it brings many parameters and a considerable amount of computation, which slows down the network. Therefore, we use the dilated convolution [46] to obtain the feature maps of different receptive fields.
Specifically, for an arbitrary input feature map X with dimension C×H×W, we cut X into four new feature maps in the channel dimension through the split operation. The output consists of four subparts and can be expressed as the following equation 1.
X i∈ [1,2,3,4] where a = C/4 represents that we divide the feature map into four new feature maps, X c means the split operation of X in the channel dimension, and X i represents the feature map i of the split. We will analyze and discuss the setting of the hyperparameter in detail later. Next, dilated convolutions are performed on the split feature map to generate feature maps of different receptive fields, which can be expressed as equation 2.
f i∈ [1,2,3,4] where F represents the dilated convolution, L represents the LeakyRelu activation function, and f i represents the output feature map i. The dilation i represents the dilated rate of the dilated convolution i. The dilated rate determines the receptive field of the feature map, which has a notable impact on the overall performance of our module. Therefore, we propose a novel mapping function to determine the dilated rate of each convolution, which can be expressed as equation 3.
where h and w represent the length and width of the input feature map, represents the rounding down operation, dilation i represents the dilated rate i. When the dilation i is 1, preserve the original receptive field of the feature map and focus on small objects in the feature map. The receptive fields of the four feature maps generated can cover the small, medium, large and immense objects in the input feature map X to solve the problem of object pixel area diversity in remote sensing images. The structure diagram of MRF is shown in Figure 4.
Additional information: to obtain the feature maps of different receptive fields and reduce the number of parameters, the size of the dilated convolution kernel is 3, and the stride is 1. To ensure the fusion of multiple receptive field feature maps, we pad the feature maps to ensure that the dimension of feature maps remains unchanged after dilated convolution.
hyperparameters discussion: We slice the input feature map into four new feature maps for two reasons. The first reason is that we have analyzed the labels of remote sensing images and found that the objects in remote sensing images can be mainly classified into small, medium, large, and immense. The second reason is that in neural networks, the number of channels in almost network structures can be divisible by 4. We do the following experiments to verify that it is optimal to cut the input feature map into four parts, where s represents the average division of the input feature map into s parts in the channel dimension. Experimental results are shown in table 1.  The number of channels in the backbone network can be divided by 2, so we set s to 2,4,8,16 for comparison. We can conclude from the results in Table 1 that when we split the input feature map into four new feature maps in the channel dimension, the model gets the best result.

2) MULTIPLE ATTENTION MAP FUSION
Equations 1,2,3 cut the input feature map and generate four new feature maps with different receptive fields. These feature maps are obliged to the small, medium, large, and immense objects in the input feature map, respectively. We propose three multiple attention map fusion structures to reduce the negative impact of background noise on the network while solving the problem of the large variety of object pixel areas in remote sensing images, including MRFA-T0, MRFA-T1, and MRFA-T2. We use the SE [22] and CA [23] modules as our basic attention blocks. The details are as follows: (1) Multiple receptive field attention T0(MRFA-T0) We use the multiple receptive field feature map generation to turn the input feature map into four feature maps with the same dimension. However, the object scales of interest in these four feature maps are different, they have different channel importance. Therefore, we calculate the importance and correlation of the channels in each feature map separately, then concatenate these four feature maps in the channel dimension to obtain multiple receptive field channel attention maps. Finally, they are applied to the input feature maps to achieve adaptive calibration of the channel relationships. This process can be expressed as equation 4.
where f is the output result of MRF and represents four feature maps with different receptive fields. The specific details are given in Equation 2. In addition, the S represents the SE attention module, and the subscripts 1,2,3,4 indicate the index numbers of the four feature maps, respectively. The Concat represents the splicing of four feature maps in the channel dimension. σ represents the Sigma activation function. The Out represents the multiple receptive fields attention fusion feature map. Finally, we adaptively calibrate the relationship between channels of the input feature map, which can be expressed as equation 5.
where X represents the input feature map, X out indicates the output result of MRF-T0. The structure diagram of MRFA-T0 is shown in Figure 5.
(2) Multiple receptive field attention T1(MRFA-T1) The design idea of this structure is to make the feature map with different receptive fields pay attention to the position information of objects of various pixel areas while adaptively calibrating the importance of the channel. In this case, four feature maps with other receptive fields are input into the CA module and get the attention information of four feature maps of different receptive fields. Finally, we can pay full attention to the objects with significant pixel area differences in the input feature map through multiple feature map fusion, which can be expressed as equation 6.
Out =σ (Concat[C 1 (f 1 ),C 2 (f 2 ),C 3 (f 3 ),C 4 (f 4 )]) (6) where f is the output result of MRF and represents four feature maps with different receptive fields. The specific details are given in Equation 2. The C represents the CA attention module, and the subscripts 1,2,3,4 indicate the index numbers of the four feature maps, respectively. The Concat represents the splicing of four feature maps in the channel dimension. σ represents the Sigma activation function. The Out represents the multiple receptive fields attention fusion feature map. Finally, we pay attention to the channel relationship and location information of the input feature map, which can be expressed as equation 7.
where X represents the input feature map, X out indicates the output result of MRF-T1. The structure diagram of MRFA-T1 is shown in Figure 6.
(3) Multiple receptive field attention T2(MRFA-T2) In this structure, our design idea is to fully integrate the multi receptive field feature map generation method with the VOLUME 10, 2022  attention mechanism. Firstly, four feature maps of different receptive fields are input into the channel attention module to recalibrate channel relevance, respectively. Then, the spatial information of the calibrated feature map is located through the CA module and concatenated in the channel dimension, which can be expressed as equation 8.
Out = σ (Concat[C 1 (S 1 (f 1 )),C 2 (S 2 (f 2 )), C 3 (S 3 (f 3 )),C 4 (S 4 (f 4 ))]) (8) where f is the output result of MRF and represents four feature maps with different receptive fields. The specific details are given in Equation 2. The S represents the SE attention module, C represents CA attention module. The subscripts 1,2,3,4 indicate the index numbers of the four feature maps, respectively. The Concat represents the splicing of four feature maps in the channel dimension. σ represents the Sigma activation function. The Out represents the multiple receptive fields attention fusion feature map. Finally, the generated multi receptive field attention map is added by the original input feature map to pay full attention to the small, medium, large and immense objects on the input feature map, which can be expressed as equation 9.
where X represents the input feature map, Conv denotes a convolutional block with kernel size and stride of 1, adaptive adjustment multi receptive field fusion feature information, X out indicates the output result of MRF-T2. The structure diagram of MRF-T2 is shown in Figure 7.

C. INTEGRATION STRATEGY
Our MRFA is a plug-and-play module and can be easily added to any backbone to improve performance. In the following experiments, the MRFA block is added to the last residual block of the last layer of the backbone network. Others attention modules are also added in the same position in the comparative experiment. The structure diagram of the integration strategy is shown in Figure 8.

IV. EXPERIMENTS
In this section, we first perform ablation experiments on our proposed MRFA to demonstrate the necessity of each component of our module. Then, we compare MRFA with other attention methods in object detection tasks. Finally, we thoroughly compare our MRFA with other attention methods in image classification and semantic segmentation to verify the generalization and versatility of our approach.

A. EXPERIMENTAL SETUP IN OBJECT DETECTION 1) DATASET DESCRIPTION
We conduct object detection experiments on remote sensing image object detection datasets DIOR [48] and TGRS-HRRSD-Dataset(HRRSD) [49]. DIOR is a large object detection dataset of remote sensing images, containing 23,463 images and 190,288 instances covering 20 object classes. The object pixel area in this dataset varies greatly, with significant differences in imaging conditions, weather, season, and quality, making it full of challenges in object detection of remote sensing images. HRRSD is a high-resolution object detection dataset of remote sensing images, produced by the Xian Institute of Optical Precision Machinery, Chinese Academy of Sciences, which contains 21761 images and 55740 object instances and covers 13 object classes.

2) EVALUATION PROTOCOL
We evaluated our experiments using the evaluation protocol proposed by the MSCOCO [50]

3) EXPERIMENTAL DETAILS
Experiments details: The batch size of all our experiments is 8. The epoch is 150 in YOLOV3 [47] and 300 in YOLOX [13]; other configurations are the default settings. All our investigations are carried out on an NVIDIA Tesla V100 16GB.

B. ABLATION EXPERIMENTS
We performed ablation experiments on MRFA to observe its behavior in remote sensing image object detection and used YOLOV3 [47] as the baseline. Since MRFA is divided into multi receptive field feature map generation and multi attention fusion, we gradually added various components of MRFA in the experiment to observe their impact on the performance of remote sensing image object detection. The experimental results are shown in Table 2. After adding the multi receptive field feature map generation module, the detector improves 1.1 AP 50:95 . After continuing to add spatial attention, the detector improves 0.9 AP 50:95 again. When we added all modules, the detector increased by 2.7 AP 50:95 .

C. COMPARED WITH STATE-OF-THE-ART ATTENTION MECHANISM IN THE OBJECT DETECTION TASK
To evaluate the superiority of MRFA, we compared it with some of the state-of-the-art methods. The experiment was conducted on the DIOR dataset, and YOLOV3 was used as the benchmark. The results are shown in Table 3. The experimental results show that the YOLOV3 detector equipped with MRF-T2 achieves optimal results in all indicators. Among them, MRF-T2 boosts the AP S of baseline 1.5, which is the smallest compared to AP M and AP L . We believe that there are two main reasons for this phenomenon. Firstly, the feature map is scaled down by a factor of 32 before input into the MRFA structure. In this way, the receptive field of each grid of the feature map has been enlarged. Therefore, MRFA is not sensitive enough to the small objects in the feature map. Secondly, the complex environment of remote sensing images, coupled with the tiny objects in remote sensing images, results in minor enhancement of small objects by MRFA. In terms of large objects, MRF-T2 improves the AP L of baseline 3.1, which is enormous. We believe this mainly benefits from the two branches of the MRF structure that focus on large and immense objects. The experimental results have well demonstrated the effectiveness of MRFA. The detection results of the YOLOV3 detector equipped with the MRFA module on the DIOR validation set are shown in Figure 9.
To verify the generalization of the MRFA module in the objects detection task, we repeated the above experiment on the anchor-free object detector YOLOX [13]. The experimental results are shown in Table 4. The experimental results show that the MRF-T1 achieves the best performance in YOLOX for all the evaluated metrics. Among them, MRF-T1 improves the AP L of baseline 2.4. In contrast, CA only improves the AP L of baseline 0.8, which proves the effectiveness of our method in focusing on large and immense objects in remote sensing images separately. In terms of AP M , the   enhancement of MRF-T1 is minor. However, we found that methods such as SE and GC reduced the AP M of baseline, and we believe that the main reason for this phenomenon is the variability of the performance of the attention module on different detectors. In the YOLOX detector, MRF-T1 achieved the best results, which differ from the results in Table 3, and we will discuss and analyze them in detail later. The detection results of the YOLOX detector equipped with the MRFA module on the DIOR validation set are shown in Figure 10.
To verify the universality of the MRFA module in the remote sensing image object detection task, we re-conducted the above experiments on the HRRSD [49] dataset. The specific results are shown in Tables 5 and 6. In the investigation of HRRSD, we found a strange phenomenon.
In the experiments on YOLOV3 and YOLOX, the promotion of AP S by MRFA is exaggerated. We did a lot of research and found that the main reason for this phenomenon is that the number of small objects in the dataset is few or no, resulting in considerable fluctuations in AP S . The detailed statistical results are shown in Table 7.
Discussion on experimental results: Through the experimental results, we can find that the performance of the attention module is different on different detectors and data sets. Therefore, the proposed MRFA has three structures, including MRF-T0, MRF-T1, and MRF-T2, which can effectively adapt to various conditions to improve the   (Tables 3 and 5), MRF-T2 has the best performance. In the YOLOX detector (Tables 4 and 6), MRF-T1 has the most significant improvement on the baseline. The experimental results are under our expectation that the MRFA (MRF-T0, MRF-T1, MRFA-T2) can achieve the best performance under different conditions.

D. REMOTE SENSING IMAGE CLASSIFICATION
To verify the versatility of the MRFA method, we transfer it to the image classification task for extensive experiments.

1) DATASETS AND EVALUATION PROTOCOL
In this part of the experiments, we use a public image classification dataset of remote sensing images called AID [51]. It has 10,000 remote sensing images with a pixel size of about 600×600, including 30 categories of scene images, with about 220-420 photos in each category. evaluation protocol: We evaluate our experiments using the metrics precision, recall, Top-1 Acc, Top-5 Acc, and F1-score. Top-1 Acc indicates the probability that the most probable category of the network prediction is correct, and Top-5 Acc indicates the likelihood that one of the top five VOLUME 10, 2022  categories of the network prediction is correct. F1-score is a statistical measure of the accuracy of a binary classification model, which considers both the precision and recall of the classification model.

2) EXPERIMENTAL DETAILS
We use ResNet [52] as the baseline for the remote sensing image classification task. We set the batchsize to 16, the initial learning rate to 0.00125, and use the cosine learning schedule for learning rate decay, and the epoch for each experiment is 100.

3) EXPERIMENTAL RESULTS
We compare our method with other attention methods in ResNet50 and ResNet101, as detailed in Table 8. The results show that our MRFA maintains the best performance at different network depths. In resnet50, our method improves the Top-1 Acc of 2.4669 and F1-score of 2.5469. In resnet101, MRFA improves the Top-1 Acc of 2.3767 and F1-score of 2.6088, but the performance of other attention methods is not satisfactory. For example, the self attention mechanism such as GC and Non-Local severely degrade image classification accuracy on the remote sensing image classification task.

E. REMOTE SENSING IMAGE SEMANTIC SEGMENTATION
In this part, we verify the versatility of MRFA in the image semantic segmentation task.

1) DATASETS AND EVALUATION PROTOCOL
We use the remote sensing image semantic segmentation dataset Wuhan dense labeling dataset (WHDLD) [53]. WHDLD contains 4940 RGB images with a pixel size of 256*256, including six categories and 22821 labels. In the semantic segmentation task, we use mIoU as the evaluation metric for our experiments.

2) EXPERIMENTAL DETAILS
We use Encnet [54] as the baseline for semantic segmentation of remote sensing images. The batchsize is set to 8, the initial learning rate is 0.00125, the cosine learning schedule is used for learning rate decay, and 40,000 iterations are performed for each training of the experiment.

3) EXPERIMENTAL RESULTS
We compared our method with other attention methods on Encnet using ResNet of different depths as the encoder. The detailed results are shown in Table 9. Experimental results have shown that our method still outperformed other state-of-the-art attention mechanisms in remote sensing segmentation tasks.

F. ADDITIONAL EXPERIMENTS
In this section, we perform additional experiments to verify the superiority of MRFA.
We compare MRFA with the multiscale strategy ASPP [20], which is similar to our approach in that it samples the feature map in parallel by multiple dilated convolutions to obtain the contextual information of the feature map. It is mainly applied to semantic segmentation tasks. However, there is still a big difference between MRFA and ASPP. First, MRFA's dilated rate is calculated by Equation 3, which is more adaptive. While ASPP's dilated rate is fixed; second, ASPP is dedicated to capturing feature map contextual information by dilated convolutions with different dilated rates, while MRFA is mainly used to focus on objects with significant scale differences in the feature map. Their purposes are various. Finally, their performances have substantial differences. The experimental results are shown in Table 10. The experimental results show that ASPP has slightly improved the performance of the YOLOV3 detector. We think this may be because it is more suitable for semantic segmentation tasks. In addition, the amount of additional parameters introduced by ASPP is enormous. Compared with ASPP, MRFA has apparent advantages in object detection tasks.  Next, we compared the YOLOV3 detector equipped with MRFA with other state-of-the-art detectors on the DIOR dataset. The experimental results are shown in Table 11, where We can see that the YOLOV3 equipped with MRFA achieves the best performance on the DIOR dataset.
Finally, we verified the effectiveness of our method again in the FCOS detector, and the experimental results are presented in Table 12. the experimental results show that MRFA is a plug-and-play attention module and is highly generalizable and versatile. It has good performance in remote sensing images.

V. DISCUSSION
Based on the above experiments, we can find that the performance of these SOTA attention methods is unsatisfactory, and we believe the difference in object pixel area in remote sensing images caused this. To test this hypothesis, we visualized the heat map for feature maps generated by different attention methods, as shown in Figure 11. We can find that these attention mechanisms can indeed reduce the background noise in remote sensing images. Still, they can not effectively pay attention to the large and small objects in the feature map simultaneously. For example, CA [23]   and CBAM [24] attention mechanisms can better focus on large objects but lack attention to small objects. The Non-Local [26] attention method focuses more on small objects and features in large objects. In contrast, MRFA can effectively suppress background noise in remote sensing images and simultaneously focus on objects with significant differences in the pixel area. However, MRFA still has some shortcomings. For example, it seems to focus more on the two ends of the ship and less on the middle part of the ship when it comes to larger ships. We think this phenomenon is because the bow and stern features are more distinctive, and   the network loaded with MRFA relies mainly on the bow and stern features to classify and locate ships.
In summary, MRFA captures objects of different pixel areas in the input feature map by different scale dilated convolution and pays attention to these objects separately, which is a novel and effective method. However, the number of extra parameters of MRFA is high.

VI. CONCLUSION
For the problems of the significant object pixel area difference and background noise in remote sensing images, this paper proposes a novel plug-and-play attention mechanism MRFA. It can effectively solve the problems of scale variations and the background noise in remote sensing images. Extensive experiments on various remote sensing visual tasks have illustrated that our method outperformed other state-ofthe-art attention mechanisms and held great generalization. However, our method has more parameters at present. Next, we will focus on reducing the number of parameters of the module while improving its performance.