Optical Remote Sensing Image Target Detection Based on Improved Feature Pyramid

At present, many deep-convolution-based remote sensing image target detection methods have been developed and have achieved higher detection accuracy and faster detection rate. However, they do not perform well in the face of datasets with large target scale changes and multiclass and dense small targets. Therefore, solving the problem of scale change of remote sensing images is the focus of our research. An improved feature pyramid model named feature enhancement feature pyramid network (FE-FPN) is presented in this article. The FE-FPN utilizes a channel enhancement module (CEM), unpooling feature fusion (UPFF), and adaptive pooling spatial attention module (APSAM) to reduce information loss during the generation of feature maps and improve its capability to represent feature pyramids. The CEM is designed to expand the receptive field and learn important features adaptively, the UPFF is designed to improve the feature fusion method to avoid feature conflicts, and the APSAM is used to complement high-level feature information. The average precision of our models using ResNet50 is 2.0% higher when the FE-FPN is replaced by the feature pyramid network in Cascade R-CNN. The proposed FE-FPN model is quantitatively compared with several classical characteristic pyramid models, which proves that the performance of the FE-FPN is superior to that of other models.


I. INTRODUCTION
R EMOTE sensing technology (RST) is not restricted by ground conditions and can acquire data information in a wide range with fast speed and short cycle time. At present, RST has penetrated into all walks of life in the national economy, providing new ideas for solving various environmental problems [1], [2], providing an important database for urban life planning and water conservancy construction [3], [4], and playing a major role in promoting economic construction, social construction, and national defense construction [5], [6], [7]. Remote sensing images (RSIs) have become increasingly rich as the technology has developed, but the information they contain has not yet been explored and analyzed in its entirety, and the future of research will focus on the processing of remote sensing information [8], [9]. In practical application, the detection task can not only be aimed at a single target but also accomplish the accurate detection of multiple possible targets in an image. In the era of convolutional neural networks, many excellent detectors have been developed, such as feature pyramid network (FPN) [10], R-CNN series [11], [12], [13], [14], YOLO series [15], [16], [17], [18], [19], and so on [20], [21], [22], [23]. However, the detection effect of these detectors is not good in the presence of RSIs, mainly because RSIs are unique compared with natural images [24], as illustrated in Fig. 1. First, RSIs are taken in a variety of angles and directions from high altitudes, and the background is extremely complex. Second, there are rich types of targets with different scales and sizes, and small targets take up fewer pixels in the whole image and are easily confused with the background, especially near large targets. Finally, there is occlusion or congestion of dense targets, making detection even more difficult. To achieve target detection on the RSI, the problem of multiscale targets and small target density should be solved first.
Lin et al. [10] proposed that inserting some network layers between the backbone network and the head network to interact with different feature maps can detect targets of multiple scales, and the proposed FPN, also known as the neck network of the detector structure, has now become a common structure for detectors. As shown in Fig. 2, the FPN structure is divided into three stages: 1) bottom-up, a process in which features are condensed layer by layer to express features; 2) lateral connection, which uses 1 × 1 convolution at output layers of the backbone network to ensure the same fusion channels; and 3) up-bottom, where two feature layers are fused by upsampling. This network structure has obvious advantages in detecting targets with large-scale range variations.
However, there are some limitations in FPNs: 1) the reduction of channel dimension will lose the spatial information of each feature layer; 2) the simple upsampling summation of different feature layers may reduce the feature expression ability; and 3) the top-down path increases the shallow information, but the lack of high-level information is not compensated. For these problems to be solved, we propose feature-enhancement feature pyramids. First, the channel dimension is reduced using convolutional groups, and the channel attention module (CAM) is introduced to improve feature utilization and enrich spatial information. Second, unpooling feature fusion (UPFF) is utilized to reduce the semantic differences between neighboring feature maps. Finally, we add residual branch connections to the high-level features to fully exploit the high-level information of the backbone network and supplement the information loss of the feature map.
There are a number of main contributions made in this article, which are listed as follows.
1) We propose a new feature enhancement feature pyramid network (FE-FPN) that effectively fuses high-bottomlevel features to improve target detection capability.
2) We introduce a convolutional combination to enrich spatial information, and add a CAM to optimize features at all the levels to improve feature representation. 3) We add residual branching connections to the high-level features to complement the high-level information. 4) We evaluate the FE-FPN to improve detection accuracy over the traditional FPN.

A. Deep Object Detectors
The target detectors based on deep learning technology, which are capable of learning target features autonomously, can be classified into two major categories: 1) two-stage detectors, mainly R-CNN series; and 2) one-stage detectors, mainly including SSD and YOLO series. R-CNN [11] is the first to use a deep model to extract image features, unlike the classical algorithm that uses a sliding window to determine all possible regions in turn; it uses the selective search method to extract candidate frames, achieving an accuracy of 49.6% in the target detection task. Compared with R-CNN, Fast R-CNN [12] adds ROI pooling, which integrates feature extraction, classification, and regression into one step and improves the detection rate. Faster R-CNN [13] designs the region proposal network on the basis of Fast R-CNN, which replaces selective search and significantly improves detection speed, truly realizing end-to-end training. Cascade R-CNN [14] addresses the IoU selection problem of Faster R-CNN by cascading detectors to achieve optimized detection results. Several extended studies have been proposed to improve this framework's performance, such as CBNet [20] and DetNet [21].
One-stage detectors do not require a region proposal stage and have faster detection speeds than two-stage detectors. Redmon et al. [15] proposed YOLO, which merges the two phases of candidate frame and classification recognition and is an end-to-end network training model that enables real-time target detection with fast recognition and low background misclassification rate. YOLOv2 [16] made a series of improvements to YOLO, designed the Darknet backbone network, and introduced the idea of anchor box in Faster R-CNN, which enhanced the detection ability of small targets to some extent. YOLOv3 [17] introduced residual connection to propose Darknet-53 on the basis of Darknet-19 and made prediction at different scales, which achieved a better balance in speed and accuracy. YOLOv8 [18] adopts the CSPDarkNet structure, in which the C2f structure has richer gradient flow, and it replaces the head network with Decoupled Head, which separates the classification and detection head and improves the convergence speed, and it is an efficient algorithm including image classification, Anchor-Free object detection, and instance segmentation. YOLO-NAS [19] uses the self-developed AutoNAC neural architecture to search for a better architecture than YOLOv8, employs a multistage training approach, quantization perception module, and selective quantization to optimize performance, achieves the best accuracy-latency balance, and reaches new heights in target detection tasks. The detection performance of one-stage detectors has also been significantly improved by a number of proposals, such as RetinaNet [22] and FCOS [23].

B. Multiscale Feature Augmentation
In practice, the images to be detected often contain multiple kinds of targets with a large span of target scales. For large-scale targets, the detector needs strong semantic information as a basis for classification, while small-scale targets need finer-grained spatial information to achieve precise localization. When the model continues to deepen, although there is a large perceptual field in the deep features, the semantic information of the object to be detected will gradually decline, which makes it difficult for the detection model to achieve simultaneous detection of large and small targets. Multiscale target detection has been extensively researched by many researchers; the image pyramid [25] has been used in detection models to solve multiscale target detection problems. The network input is a randomly generated multiscale image that can generate multiscale feature expressions, and each feature map has strong semantic information, which solves the problem that the target scale changes greatly to some extent. However, the network training process will have serious memory occupation and long inference time, which does not have applicability. In addition, different layers in a convolutional network can be used to improve detection. For example, the SSD [26] model uses the structure of pyramid feature hierarchy, and the feature maps with different sizes detect different objects. However, detecting small objects requires a large enough feature map to provide finer features and do more intensive sampling, as well as enough semantic meaning to distinguish it from the background. Therefore, the poor effectiveness of SSD detection of small targets is mainly due to the lack of sufficient semantic information in shallow large-scale feature maps.
Based on the above problems, Lin et al. [10] proposed a top-down network architecture as a solution, which named FPN. This structure incorporates adjacent feature maps, effectively solving the problem of large-scale variation of the object to be detected, and the FPN has now become the base module of many detectors. Subsequently, a series of discussions and improvements were made to the structure of FPN. PANet [27] adds bottom-up paths to the FPN structure to further facilitate the flow of information between high-and low-level features. The BiFPN [28] proposes bidirectional cross-scale connectivity as well as a weighted feature fusion algorithm based on the structure of PANet, discarding single-scale connectivity and fusing richer features; subsequently, the NAS-FPN [29] automatically designs the network using a neural structure search approach, further optimizing the design of the FPN for target detection, but its network training needs to provide a sufficiently powerful GPU performance. The AC-FPN [30] proposes attention-guided modules to enrich contextual information using different transmitter-receiver domains, and the model has better detection accuracy for larger input images.
In the abovementioned methods, the focus is on the network construction and feature fusion methods of the FPN. Since various fusion methods can improve the multiscale expression Fig. 3. Architecture of the FE-FPN. The CEM is proposed to enrich the feature map information. UPFF fuses adjacent feature maps according to the low-level coordinate information to improve the efficiency of information interaction between feature maps. The APSAM makes full use of high-level context information and merges it into M5 so that the feature maps can reduce the influence of information loss caused by the reduction of the number of channels and enhance the ability of feature representation.
capability and, at the same time, there will be information redundancy, and there is also semantic information loss in the feature transmission process, it is our research focus to compensate for the information loss and improve the efficiency of adjacent feature information fusion.

A. Overall Structure
An overview of the framework named FE-FPN is presented in Fig. 3.
Our framework relies on ResNet50 [31] as its backbone. The backbone network contains a total of five stages, of which we select four stages, called C2, C3, C4, and C5, as the input to the pyramid network. C1 was not selected because its feature map size is large, which consumes a lot of memory during training, and its semantic information is insufficient to contribute much to the detection task. M2, M3, M4, and M5 are the features after the lateral connection, and then, P2, P3, P4, and P5 output by our pyramid network are used as the input of the next head network to complete the detection task.
In the feature map output by the backbone network, the lowlevel feature details are richer and the receptive field is smaller, which is more suitable for detecting small targets, while the high-level feature resolution is low and contains rich semantic information, which is more suitable for detecting large targets. Low resolution, blurred picture, little information, and much noise are the difficulties in small target detection. The high-level strong semantic information is transmitted to the low level, which makes the low-level semantic information rich while maintaining a large resolution, and improves the detection effect of small objects. In the FPN, feature maps at different scales can be produced by merging low-and high-level features to obtain rich information. However, after multiscale feature extraction, it is not determined which feature is more effective for detection, and the FPN directly makes the features of each layer fuse after lateral connection. On the other hand, feature fusion uses simple summation, ignoring the problem of feature conflict and misalignment in target detection. Therefore, in the research of multiscale target detection in the RSI, it is particularly important to make better use of the respective advantages of high-and low-level features.
Overall, our FE-FPN model is improved based on FPN models, consisting of the channel enhancement module (CEM), the UPFF module, and the adaptive pooling spatial attention module (APSAM). We use the CEM to expand the receptive field of each feature map and adaptively adjust the weight of feature channels. Then, the extracted features are added to the top-down path, and the features are fused by the UPFF method. At the same time, residual differential branches are added to the high-level features to supplement the feature information, and finally, the output features are transmitted to the detection head for prediction.

B. Channel Enhancement Module
In the FPN, the lateral connection is a 1 × 1 convolutional layer, and the reduction of the number of feature channels at each level results in a part of spatial information loss. Another issue of 1 × 1 convolution is that the receptive field is not large enough. The main methods used to increase the perceptual field are adding pooling layers and increasing the convolutional kernel, but the process of pooling will lead to a reduction in resolution resulting in the loss of feature details, and the increase of convolutional kernel [32], [33] will also introduce too many parameters leading to overfitting of the model. By using two 3 × 3 convolutions instead of 5 × 5 convolutions, we are able to reduce the number of parameters caused by the increase of the convolution kernel. The overall structure of the CEM is visualized in Fig. 4. We choose a set of convolutions with kernel sizes of 1 × 1 and 3 × 3 convolution blocks at the lateral connection, where different convolutional kernel sizes provide different perceptual fields, and then, we merge these features together so that small targets can be detected more easily. During this process, each layer of our features has the same number of channel to ensure the fusion of subsequent features.
On the other hand, we introduce a CAM [34], [35] to correct the features and retain the valuable ones. As illustrated in Fig. 4, we first use global average pooling to compress the spatial information, and generate the weight vector by ReLU and Sigmod functions. Subsequently, important features of an image are focused on multiplying the weight vector by the original feature map and, finally, combined with the input feature map to output the feature map, which can make the result more reliable.

C. Unpooling Feature Fusion
Feature fusion refers to taking advantage of the complementarity between features to fuse the advantages between features to make the model perform better. In feature fusion, we have to consider the role of different features for the target task and the semantic gap between them, so that features with complementary information can be fully fused and features with the same role can be reduced to avoid information redundancy [36].
The resolution of the four feature maps output by the backbone network is reduced by half in turn, so passing the high-level semantic features to lower layers requires upsampling to reach the same resolution before fusion. The FPN adopts nearest neighbor upsampling in feature fusion, i.e., the gray value of the transformed pixel is made equal to the gray value of the nearest input pixel, and the high-level semantic features are fused with the lower level features by summation. It preserves the semantic information of the feature map to the greatest extent, with simple calculation and fast speed. However, the exact position correspondence among the feature maps of each layer has been lost after the acquisition of padding, convolution, and other layers, so it is difficult to exchange information effectively by simple summation and fusion. To this end, we introduce an upsampling method [37], which merges high-level features to low-level features by the location information of low-level features.
As shown in Fig. 5, first of all, we observe that max-pooling can select features with better classification recognition, so we choose to do max-pooling on the low-level features to record the coordinate positions of the corresponding activation values. Then, this position information is passed to the high-level features as the parameters of unpooling upsampling to expand the features.

D. Adaptive Pooling Spatial Attention Module (APSAM)
The top-down path of the traditional FPN makes the semantic information of high-level features transmitted to bottom-level features, which makes the shallow feature map contain context information of different scales. However, top-level features have not received information from other features after convolution dimensionality reduction. Accordingly, we add a residual branch to extract the contextual information from different spaces in C5 to add to M5 to generate more powerful features.
As can be seen in Fig. 6, to integrate rich contextual information, we construct adaptive pooling pyramids [38] to obtain multiple contextual features at different scales and then  stitch these feature maps in the channel dimension. Then, a spatial attention module [39], [40] is constructed so that each feature map output from the adaptive pooling pyramid passes through 1 × 1 convolutional layer, ReLU activation layer, 3 × 3 convolutional layer, and Sigmoid activation layer, in turn, to generate the corresponding spatial weights. M6 is endowed with multiscale context information by using the Hadamard product operation on the weights and feature maps. A feature map can be extracted at different scales with the help of APSAM to provide more contextual information to assist in the interaction between features.

A. Dataset
We created the multiscale and multitarget remote sensing (MMRS) dataset based on the public dataset (DIOR) [24]. DIOR is an open-source RSI target detection dataset by Northwestern Polytechnical University. The dataset covers different weather environments with large background variation, and it contains 20 categories and a total of 23 463 images with a size of 800 × 800. The objects of the dataset not only vary greatly in size between categories but also have a wide range of scale variation between the same categories. Our experiments were conducted on the MMRS dataset, which contains 20 categories, 4690 training images, and 1172 validation images. Table I shows the number of various targets and examples of our dataset, and the example is given in Fig. 7.

B. Implementation Details
Under the Windows 10 operating system, we used a machine equipped with NVIDIA Tesla V100 GPU (32-GB memory) as the hardware platform to perform all experiments. For coding and experiments, Paddle was used, and the Python 3.7.0 and PaddlePaddle 2.4.0 compiling environments were used. Moreover, models are trained for 100 epochs, with a learning rate of 0.01 initially and decreasing by 0.1 in the 80th and 90th epoch.
ResNet [31] is one of the most representative neural network models and one of the backbone networks widely used at present. The residual connection structure avoids the problems of gradient disappearance and gradient explosion in the deep network and allows us to train on deeper models. What is more, the ResNet network structure is simple and easy to understand and realize, and many variants based on ResNet can also help us develop new models faster. There are five different sizes of networks in ResNet, namely ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152. The difference between networks mainly lies in the difference in the number and parameters of the block in the middle convolution part. With the deepening of networks, more computing resources and longer training time are needed. According to the characteristics of our dataset, we choose ResNet50 as the backbone network, which uses 49 convolution layers and one fully connected layer, which can better capture the details and features of images and provide higher accuracy in network training.

C. Main Results
In this section, we compare the performance of FE-FPN to that of baseline methods in Cascade R-CNN and Faster R-CNN. For quantitative comparison, we choose COCO-style average precision (AP) metrics as standard [41]. AP is defined as the integration of the precision over the recall rate AP =   where P represents precision, which measures the percentage of correct positive tests, and R represents recall rate, which measures the proportion of actual positive instances that are correctly detected. High accuracy indicates that the algorithm can correctly identify objects, while high recall means that it can detect most of the objects in the scene. From the equation, IoU is defined as the overlap rate between the predicted bounding box and the ground truth bounding box, which can be used as a metric to evaluate the correctness of the bounding box, TP is the number of all predicted bounding boxes with correct classification and correct bounding box coordinates, FP is the IoU of the prediction frame and the real frame is less than the threshold we set, i.e., the number of classification errors in the prediction frame or inaccurate bounding box coordinates, and FN is the number of all target real frames that are not predicted. By setting different thresholds, we can get different precision and recall values for each category. AP 50 and AP 75 represent the calculation under the IoU threshold of 0.5 and 0.75, respectively.
For multicategory detection, we choose mean average precision (mAP), which is the average of each category of AP. In terms of the capabilities in multiscale object detection, we define a small target as a target with a resolution of less than 32 pixels × 32 pixels, a medium target as a target with a resolution of more than 32 pixels × 32 pixels and less than 64 pixels × 64 pixels, and a large target as a target with a resolution of more than 64 pixels × 64 pixels. The detection accuracy is denoted by AP S , AP M , and AP L , respectively. In addition to detection accuracy, another important performance index of the target detection algorithm is speed. Only fast speed can realize real-time detection, which is extremely important for some application scenarios. In this article, we use frame per second (FPS) as a metric to evaluate the speed of target detection. Table II shows all the results. Faster R-CNN with FE-FPN using ResNet50 as the backbone achieves 60.7 AP, which is 1.4% higher than that of the classic FPN. Besides, when using powerful models like Cascade R-CNN with ResNet50, our model achieves 65.9 AP, which is 2.0% higher, and AP s increased by 1.1%; the improvement in AP L demonstrates that our model is greatly improved at capturing information from large receptive fields.
Likewise, we compare the performance of the FE-FPN model proposed in this article with classical models such as FPN, PAFPN, HRFPN, BiFPN, and ACFPN, using the same environment configuration and backbone network (Cascade R-CNN with ResNet-50) to ensure the effectiveness of the comparative experiment. Because of the limitation of our hardware, there is a lack of comparison with the previously mentioned NAS-FPN algorithms. We believe that the above experiments are enough to prove the effectiveness of our model on multiscale and multitarget images. The comparative results of multiscale target detection are shown in Table II, and we find that the FPN model is less effective in detecting. Although the PAFPN of path augmentation improves the detection accuracy of large targets, it is limited to detect targets of other scales. The HRFPN of maintained high-resolution characterization improves small target detection very little. The BiFPN of weighted bidirectional fusion has some improvement on small target detection, but its improvement on large target detection is limited. The ACFPN with integrated attention orientation has a good detection effect on medium targets, but it has a poor detection effect on targets of other scales. Although our model still has some limitations for medium target detection, overall, our model outperforms other models and has some advantages in large and small target detection, which is because our model first acquires the multiscale perceptual field and attention mechanism to get the most out of the information about the features of each layer, and largely avoids the phenomenon of misalignment of adjacent feature pairs.
The visualization results are shown in Fig. 8. There is no doubt that the proposed FE-FPN model is the most effective when it comes to detecting multiscale and multicategory objects from complex remote sensing backgrounds on the MMRS dataset. For dense small targets, the FPN model showed some missed detections [red circle, Fig. 8(b)] and false detections; for example, the small plane in a group of planes was falsely detected as a vehicle [white circle, Fig. 8(b)]; for images where large and small targets coexist, the FPN model does not detect small-scale targets, and for very large targets, the FPN is also prone to miss detection [red circle, Fig. 8(b)]. In addition, missed detections [red circle, Fig. 8 PAFPN, HRFPN, BiFPN, and ACFPN models. There were fewer false and missed detections in the proposed FE-FPN model compared with other models, and it detected large-size targets, dense small targets, and images of multiscale coexistence with higher accuracy than that of other models [see Fig. 8(g)].
In addition, we compare the inference speed of each model, and the results in Table II show that the model compromises speed while improving detection accuracy. Although PAFPN has a high inference speed, its improvement of target detection accuracy is very limited, and our model mainly adds three modules to improve accuracy, each of which undoubtedly increases the computation of the network and makes the inference speed decrease. How to trade off the detection accuracy and inference speed of the model is the key research direction for future target detection models.

D. Ablation Study
In order to analyze the effects of the individual components in our proposed method, we conduct extensive ablation experiments in this section. All the ablation studies follow Cascade R-CNN with ResNet50 method as a baseline, and we gradually add the CEM, UPFF, and APSAM. A summary of the overall ablation studies can be found in Table III.  Table III shows the ablation result of incrementally adding the component training on the baseline model. According to the results, the baseline model provides a detection mAP of 63.9%. The CEM improves the baseline method by 0.9 AP; AP 50 increased by 0.3%, AP 75 increased by 1.0%, AP S increased by 1.0%, and AP L increased by 1.3%. The main reason is that, on the one hand, we extend the field of view of the feature map, and, on the other hand, we improve the utilization of important features. The model with only the UPFF or APSAM module also improves the detection accuracy to a certain extent. When these components are added, it can achieve 65.9 AP with 2.0 AP improvement. Specifically, the improvements of AP L (+2.6 AP) contribute most to the final improvement. It is also important to note that the model has reduced accuracy for medium target detection; as can be seen from Table III, when using the CEM, the midtarget detection accuracy decreases by 1.2, and when using the UPFF module, the accuracy decreases by 1.6 with a high inference speed. We consider that this is due to the high repetition rate of the information provided to the middle layer features; the duplicate information is not omitted while extracting the output feature information from the backbone network causing    this information affect the detection results, and we will filter this information in the next step. In general, these results indicate that providing more feature information is more advantageous for detecting large and small target objects.
To visually compare the effect of each module, we show the detection effect of each module in the Fig. 9. Compared with the ground truth [see Fig. 9(a)], there are some false checks in the model using only the CEM [white circle, Fig. 9(c)] and the model using only the UPFF module [white circle, Fig. 9(d)]. Based solely on the APSAM, the model has a reasonable amount of false detections [red circle, Fig. 9(e)] and missed detections [white circle, Fig. 9(e)]. We believe this is because when more information is passed in from higher level features, the effect of their feature pair misalignment has a greater impact on the detection effect during downward transmission due to the use of simple upsampling summation fusion. As shown in Fig. 9(d), when the model using all three modules simultaneously has better detection results in detecting multiscale targets, this shows that our proposed three modules are complementary and contribute to the improvement of target detection accuracy in each category and at each scale.
Moreover, we also compare the effects of convolution t selection and attention mechanism in the CEM. In Table IV, the results show that convolution selection or attention mechanism alone can improve the detection performance approximately the same, and the combination of them can improve the detection by 0.9%.

V. DISCUSSION
The FPN is a classical neck network, but it shows defects in the multiscale object detection of RSIs. In view of this problem, our article aims to develop a neck network that is capable of realizing multiscale and multiclass detection. In comparison with several important CNN-based models (such as FPN, PAFPN, HRFPN, and BiFPN), experimental results indicate that our model has higher accuracy in detecting large and small targets and significantly reduces the number of false and missed detections. The FE-FPN model consists of the CEM, UPFF, and APSAM models, and we have proved through experiments that these three modules are complementary to each other. The CEM expands the receptive field and adjusts the channel weights, instead of simply using convolution downsampling. The UPFF upsampling fusion method proposed in this article is different from the current mainstream interpolation upsampling method. Our method is to perform unpooling operation according to the position coordinate information of target features, which can align adjacent features with coordinate information before feature fusion and reduce the phenomenon of feature misalignment. Also, in the process of continuous upsampling, information aliasing between features leads to poor detection effects of smalland medium-sized targets, which is a problem that we need to solve in the next step. The APSAM obtains multiscale information from the original feature map to enrich high-level feature information.
However, it is worth noting that the average accuracy (mAP) of small targets on the test set is only 16.7%, which is significantly lower than that of middle and large targets. We believe that this may be caused by the difficulty of small target feature extraction. Then, our experiments are only conducted on the MMRS dataset, which is not representative; especially, the verification of small target detection effect needs to be carried out in special small target datasets. And the research on the improvement and promotion of model reasoning speed is not sufficient. Moreover, the detection effect of our model on medium-sized targets has some decline, which we consider may be caused by the redundancy of features in the upsampling process. Therefore, in our future research, first, we will design a feature selection module to solve the problem of degradation of detection accuracy due to the repetition of feature information; second, we will validate the effectiveness of FE-FPN on different datasets, different networks, and even different multiscale vision tasks; finally, we will pay more attention to how to balance the accuracy and speed of multiscale target detection.

VI. CONCLUSION
RSIs contain rich spatial structure information and geographic location information, and the processing and analysis of RSIs is a hot research direction at present. However, the RSIs have a complex background, the range of target size varies widely, and small targets are easily overlooked when large and small targets exist at the same time; these problems make the target detection task of RSI extremely challenging. In this article, we analyzed the multiscale target detection model and the limitations of traditional FPNs. For the problem of information loss in convolutional dimensionality reduction, we first enriched the feature scale by adding convolutional blocks and then further added the attention module to quickly obtain effective information and improve the characterization ability of the feature network. For the problem of redundancy in adjacent feature fusion, we introduced UPFF to provide fusion efficiency. For the problem of a single scale of high-level features, residual branching proposed in this article can fully exploit C5 information to supplement the semantic information of the output feature layer. The empirical results verified that the proposed FE-FPN structure increases the ability of feature maps to extract information and to represent multiscale objects efficiently. In the future, we plan to explore a better multiscale feature pyramid detection model for the current limitations. At the same time, experiments will be carried out in different datasets and different backbone networks, and the model will be continuously improved to achieve a plug-and-play effect.