Introduction
In Recent years, with the development of remote sensing technology, object detection has reached a new level [1]. The technology of detecting ocean targets based on remote sensing images uses visual techniques to detect and monitor sea targets through satellite remote sensing images, which has the advantages of low cost, less interference, and less traceability compared to radar technology. The technology of detecting ocean targets based on remote sensing images can capture small targets such as warships and fighter jets from a long distance, accurately detecting these small targets is crucial for ensuring national security. Optical remote sensing images can capture maritime traffic facilities such as buoys and floats, as well as vessels that evade vessel traffic service (VTS) systems, which is important for safe navigation and maritime traffic management. In addition, optical remote sensing images can also capture floating objects, plankton, sediments, as well as minerals, petroleum, and other resources, which have important implications for marine environmental monitoring and resource exploration. Therefore, the significance of small-object detection in remote sensing optical images is crucial as it involves multiple fields, such as military security, maritime traffic safety, environmental monitoring, and resource exploration.
Currently, there are remote sensing image detection methods based on traditional features and deep learning. In traditional object detection methods, there are generally three stages: first, using the sliding window method to extract candidate boxes from the target image; then extracting features from these regions; finally, using a trained classifier for classification, such as Hog+SVM [2], Haar+Adaboost [3], [4], and DPM [5]. However, traditional methods lack robustness to the diversity of targets due to hand-designed features, resulting in slow detection speed and low accuracy, and thus cannot perform end-to-end detection. To address these issues, deep-learning-based object detection [6], [7], [8], [9] has rapidly developed, mainly divided into two-stage and one-stage approaches [10], [11], [12]. The two-stage algorithm first generates region proposals and then uses convolutional neural networks to classify the candidate regions. Therefore, the two-stage network [13], [14], [15] has high detection accuracy but slow inference speed, such as R-CNN [7], Fast R-CNN [10], Faster R-CNN [8], and Mask-CNN [12]. The one-stage algorithm does not require a region proposal stage [8], [9], [10] and directly generates the object's class probability and position coordinates, thus having a faster detection speed. Examples include single shot multibox detector (SSD) [13], RetinaNet [9], and you only look once (YOLO) [15], [16], [17], [18], [19], [20]. Although deep learning has made significant progress in object detection, there are still many shortcomings that need to be addressed in the field of small-target detection in ocean remote sensing. First, there are a large number of small targets in remote sensing data with pixel ranges between [10, 50]. After feature extraction by the network, the resolution of the feature map is reduced, causing small targets to disappear, making it difficult to accurately locate and detect them. Second, the marine weather environment is complex and diverse, and remote sensing images contain a lot of noise, such as rain and fog, which greatly interferes with the extraction of small-target features by the network. Generally, deep-learning-based networks show a decline in performance when processing remote sensing data, which further challenges the tasks of ocean remote sensing detection.
The evaluation of deep learning algorithms cannot be separated from the support of datasets. Currently, there are many marine remote sensing datasets available. NWPU VHR-10 [21] is a publicly available optical remote sensing dataset containing 10 types of remote sensing targets. Li et al. [22] proposed the Dior dataset, which is a publicly available optical remote sensing dataset containing 20 types of remote sensing targets. The HRSC2016 dataset [23] is a high-resolution marine remote sensing dataset containing only the ship category, sourced from six famous ports. However, these datasets do not only include marine targets or a single type of marine target, and the remote sensing datasets do not contain images in harsh marine weather conditions, resulting in low detection accuracy of marine targets with noise for remote sensing target detection networks. Therefore, we use algorithms to add noise of marine weather conditions to optical remote sensing images and construct the marine target detection dataset (MTDS) remote sensing dataset containing six marine target categories.
To further enhance the detection accuracy of small targets, most existing methods [17], [18], [19], [20] increase the resolution of the input image, adopt data augmentation, perform feature fusion, and recalculate anchor sizes to improve the recognition accuracy of small targets. Increasing the resolution of the image often leads to an increase in the computational complexity of the network, making the inference speed of the network slower; data augmentation can improve the recognition accuracy and robustness of the network, but the improvement in the recognition accuracy of small targets is not significant; feature fusion in the network is a good method to improve the recognition accuracy of small targets, such as FPN or FPN-based methods [24], [25], [26], but this method still loses context information in the shallow to deep layers of the network and introduces noise to interfere with gradient calculation, causing inconsistency in gradient calculation. BI-FPN [27] proposes a weighted bidirectional feature pyramid network, introducing weights of different scale features to better balance the scale feature information of different sizes of targets, but it exhibits inadaptability to different datasets; and the recalculation of anchors adapts the network candidate boxes from the perspective of the dataset, which will perform different calculations for different datasets. Although it improves the detection ability of small targets, the process is cumbersome.
In addition to poor performance in detecting small targets, there are two problems in remote sensing target detection, namely, the existence of multiscale targets in a complex background and the detection speed. In order to solve these problems, Ren et al. [8] abandoned the traditional sliding window [10] and selective search algorithms in the process of detecting targets, directly using RPN [8] to generate three different scales of anchors to detect targets. Compared with other two-stage object detection algorithms, it has faster detection speed and higher detection accuracy. However, the feature map extracted by this method has a low resolution and is not friendly to the detection of small targets. Moreover, the network finally uses fully connected layers, which produces a large number of parameters and cannot meet the requirements of real-time performance. In order to meet the requirements of speed and real-time performance, SSD [13] adopted the one-stage idea of YOLO [15], [16], [17] to directly classify and regress bounding boxes, and also borrowed the anchor mechanism from Faster R-CNN [8] to improve the detection accuracy. The fusion of these two methods has improved the accuracy and speed of detection. However, the number of anchors produced in SSD during the generation of low-level features is small compared to RetinaNet, which has a faster detection speed but lower accuracy. To further improve the recognition accuracy of multiscale targets, YOLOv3 [17] and RetinaNet [9] both use the feature pyramid method to fuse multiscale features and improve the detection ability of multiscale targets. However, the former produces fewer anchors than the latter, so it has a faster detection speed but lower accuracy.
The introduction of YOLOV5 [20] has led to simultaneous improvements in detection accuracy and speed. It adopts operations such as adaptive anchor calculation, FPN+PAN structure [20], and complete intersection over union (CIOU) loss function, which continuously strengthen the network's feature extraction and fusion capabilities, resulting in improved detection accuracy. At the same time, various lightweight models based on YOLOV5 have been proposed. Among them, PicoDet uses depthwise separable convolution and block stacking to significantly reduce the network's parameter count, making it easier to deploy on mobile devices, but it may increase the missed detection rate of small objects. In order to further improve the detection performance, YOLOX [19] modifies the detection head of the network by introducing the decoupled head structure and anchor-free method, and incorporates methods such as multipositives and SimOTA to improve the filtering of proposal boxes. However, introducing the decoupled head structure increases the network's computational complexity and slows down the model's running speed.
This article proposes a super feature pyramid network (HFPNet) shown in Fig. 1, which effectively improves the feature extraction ability for small targets without significantly increasing network parameters, thus improving the accuracy of remote sensing object detection compared to other methods. The feature enhancement module (FEM) in the network segments the input features in the channel dimension and weights the subfeature sets separately in space and channel to enhance the feature extraction ability for small targets and weaken the noise interference of nontarget background, further reducing the amount of information lost in the process of reducing feature map resolution. Since the sizes of ocean remote sensing targets differ significantly, to avoid the problem of good detection of large targets and poor detection of small targets, the multiscale feature aggregation module (MFM) module of this article aggregates the enhanced features output by FEM from multiple scales, further reducing the probability of small targets loss. In addition, in the testing phase, this article uses the CIOU-NMS [28] postprocessing optimization method to improve the test accuracy of ocean remote sensing target detection.
HFPNet structure. HFPNet consists of four parts: Input, backbone, neck, and head. The input part takes the image enhanced by environmental enhancement. The CBS module consists of convolution, batch normalization, and SiLU activation functions. The deep blue FEM and orange MFM are the feature enhancement and multiscale feature fusion modules proposed in this article. The structures of the C3, CA, and SPPFCSPC modules are shown on the right-hand side of the figure.
Our contributions are as follows.
We proposed a novel HFPNet remote sensing object detection model, which includes FEM and MFM modules. In inference, the network can aggregate multiscale features based on the salient feature maps of the targets, improving the detection accuracy of remote sensing objects.
We addressed the problem of imbalanced positive and negative samples in remote sensing data by using FocalLoss to dynamically balance classification samples and adding a decay factor
to dynamically balance the loss weights of easy and difficult samples. We also used EIOU as the regression loss to improve the detection accuracy of small remote sensing objects and speed up the convergence rate of the network.\gamma \in [0,5] We validated various improved NMS postprocessing methods and ultimately used the distance intersection over union (DIoU)-NMS method to achieve state-of-the-art performance in remote sensing object detection experiments with HFPNet.
We constructed a dataset for remote sensing targets (MTDS) and validated our model on the environment-enhanced MTDS and NWPU-VHR-10 datasets, achieving the best detection performance on each.
Model Structure and Loss
In this section, we introduced the proposed method and the functionalities and implementation details of its modules. HFPNetmodel is a one-stage object detection model developed based on YOLOV5. The entire model consists of four parts: input, backbone, neck, and head, as shown in Fig. 1.
A. Structure of HFPNet
In terms of input, in addition to methods such as mosaic data augmentation, adaptive anchor box calculation, and adaptive image scaling, an algorithm is also used to simulate ocean weather conditions for the input remote sensing data, adding factors such as lighting, rain and fog, and blur to the network training to improve the model's robustness for ocean remote sensing target detection. For the backbone, we used CBL (Conv/BN/SiLU) and C3 structures for feature extraction and performed multiscale feature fusion in the last layer. In the neck part, we designed the FEM and MFM modules to enhance the network's feature extraction ability for remote sensing data and generalization ability for remote sensing multiscale targets. We also added a coordinate attention mechanism to focus on small targets in the front part of the small-target detection head.
B. Structure of FEM
Current research in the field of computer vision has proven that the method of adding channel attention and spatial attention can significantly improve the model. For example, SENet [29] automatically obtains the importance of each feature channel by learning and uses the obtained importance to enhance features and suppress features that are not important to the current task while the channel information of the object is important, the spatial structure cannot be ignored either. Therefore, the convolutional block attention module (CBAM) [30] combines the two-dimensional attention mechanism of feature channel and feature space. It not only considers important feature weighting and nonimportant feature suppression from the channel but also considers the spatial structure. The performance of the network is improved without significantly increasing the amount of computation and parameters. Although CBAM [30] tries to introduce position information by using a global pooling method on the channel, this method can only capture local information and cannot obtain long-range dependent [31], [32] information.
The FEM structure proposed in this article can enhance the feature extraction of the input data, weight the pixels of the object to be detected in the ocean, and, at the same time, weaken the weight of the background and nonobject pixels to improve the accuracy of object detection. The FEM structure is shown in Fig. 2.
FEM structure diagram. It adopts the method of channel segmentation to focus on the information of the channel branch and the spatial branch. The channel and spatial feature maps with added weight information are added to the input feature maps, and random feature maps are generated by shuffling the channels.
First, the feature information
C. Structure of MFM
MFM mainly adopts multiple dilated convolutions in parallel to obtain receptive fields of different scales and aggregates features of each scale through global weight information. This method can enhance the robustness of the network to multiscale objects in the ocean, thereby improving the accuracy of the model in remote sensing object detection. The MFM structure is shown in Fig. 3.
MFM structure, CBR module refers to convolution, batch normalization, and ReLu activation function. The CBR module refers to the concatenation of convolutional layers, normalization layers, and activation layers.
The specific process is the features after the backbone and neck are spliced and, then, through four dilated convolutional layers of different scales and a global average pooling layer. The purpose of multiscale atrous convolution [35] is to obtain receptive fields of different sizes of input features to improve the multiscale detection ability of marine remote sensing objects. The purpose of global average pooling is to make the output approximate to the input features to the greatest extent while obtaining a global receptive field, subsequently, the output weight vector reduces the dimensionality by an attenuation factor to reduce the computational load of the network, and outputs
D. HFPNet Loss Function
The loss of HFPNet is mainly composed of three parts: 1) classification loss; 2) localization loss; and 3) confidence loss. Among them, the classification loss uses the focal loss function, the localization loss uses the EIoU loss [9], and BCE With logits loss is used as the confidence loss.
1) Classification Loss Function
Recently, the detection performance of the YOLOV5 network on various conventional datasets has reached an advanced level, but there is still room for improvement in the field of remote sensing. The YOLOV5 model takes the binary classification into consideration in the classification and uses BCE With Logits Loss as the classification loss function. See formulas (1) and (2) for the loss function
\begin{align*}
\mathrm{P}=&\log (\sigma (x))=\log \left(\frac{1}{1+e^{-x}}\right) \tag{1}
\\
\text{loss}=&-\frac{1}{n} \sum _{x}[y \ln P+(1-y) \ln (1-P)] \tag{2}
\end{align*}
Although the BCE with logits loss has a fast convergence speed, it loses performance when controlling the weight of positive and negative samples and controlling the classification weights of easy samples (the difference between the predicted value and the real value is small) and difficult samples (the difference between the predicted value and the real value is large). In the one-stage object detection, the target picture may generate thousands of anchor boxes, but the anchor boxes (positive samples) that can match the object are generally only a dozen or even 20, and the rest will be designated as negative samples.
Negative samples can produce small losses but the total loss can still be highly disruptive to the training weights of positive samples. In response to the aforementioned problems, this article introduces the focal loss function [9], which can dynamically balance the uneven proportion of positive and negative samples in the training process through a dynamic scaling factor
\begin{align*}
\text{CE}\left(p_{t}\right)=&-\log \left(p_{t}\right) \tag{3}
\\
\alpha _{t}=&\left\lbrace \begin{array}{c}\alpha, \mathbf{if} y=1 \\
1-\alpha, \text{otherwise} \end{array}\right. \tag{4}
\\
\text{FL}\left(p_{t}\right)=&-\alpha _{t} \log \left(p_{t}\right) \tag{5}
\end{align*}
In addition, simple negative samples will make a major contribution to the loss of the network and will dominate the update direction of the gradient, making it impossible to accurately classify objects. Therefore, the focal loss function adds a
\begin{equation*}
\text{FL}\left(p_{t}\right)=-\alpha _{t}\left(1-p_{t}\right)^\gamma \log \left(p_{t}\right). \tag{6}
\end{equation*}
2) Localization Loss Function
The generalized intersection over union (GIOU) positioning loss function [36] solves the problem of consistent loss when the traditional intersection over union (IOU)=0 or multiple anchor boxes have the same IOU [37] value, thereby accelerating the convergence speed of the loss, as shown in the second column of Fig. 4, but in GIOU, when the real box contains the predicted box, it degenerates into the original IoU, and the relative position problem is not solved. DIoU loss [28] adds relative position information on the basis of GIoU, and judges the loss according to the distance. However, according to the third column of Fig. 4, it can be seen that DIoU is inversely proportional to the diagonal distance
Although CIoU [28] adds the aspect ratio
\begin{align*}
V&=\frac{4}{\pi ^{2}}\left(\arctan \frac{w^{g t}}{h^{g t}}-\arctan \frac{w^{p}}{h^{p}}\right) \tag{7}
\\
R_{\text{CIoU}}&=\frac{d^{2}}{L^{2}}+\frac{V^{2}}{(1-\text{IoU})+V}. \tag{8}
\end{align*}
Among them,
In order to further improve the accuracy of maritime remote sensing object detection, we adopt the EIoU loss [38], as shown in the fifth column of Fig. 4, which considers the overlapping area between the predicted anchor and the real anchor, the center point distance, and the loss of width and height. The loss of width and height directly minimizes the difference between the width and height of the predicted anchor and the real anchor, making the convergence faster and the regression position more accurate. Its loss function is shown in
\begin{equation*}
L_{\text{EIoU}}=1-\text{IOU}+\frac{d^{2}}{L^{2}}+\frac{\rho ^{2}\left(w, w^{g t}\right)}{C_{w}^{2}}+\frac{\rho ^{2}\left(h, h^{g t}\right)}{C_{h}^{2}}. \tag{9}
\end{equation*}
Experiments and Results
A. Experimental Environment and Data Preparation
This experiment is based on the torch deep learning framework and is carried out in the GPU environment during training and testing. The specific environment configuration is shown in Table I.
In terms of datasets, this article used the publicly available remote sensing dataset NWPU VHR-10, which contains 800 remote sensing images with a total of 10 categories of objects, including ships, ports, bridges, etc. In addition, the MTDS dataset, which contains 800 remote sensing images, was constructed in this article. The MTDS dataset consists exclusively of marine objects such as ships, ports, bridges, dams, islands, and wind turbines, with respective image quantities of 240, 130, 80, 80, 130, and 140. The image data is sourced from Google Earth. About 40% of the remote sensing images from both datasets were randomly subjected to environmental enhancement, resulting in a dataset of 2400 marine remote sensing images, which were then divided into training, validation, and test sets with a ratio of 7:2:1.
To simulate the complex environmental conditions often encountered in the ocean, this article added blur, lighting intensity, and rain and fog factors to the dataset using algorithms, with the aim of making the model more robust. Specifically, OpenCV was used to perform environmental enhancement operations on remote sensing images, controlling noise levels such as lighting with uniform random numbers and thresholds. Random noise, filters, and other methods were used to add rain and fog effects to the images while Gaussian motion blur was used to simulate realistic motion blur effects.
To demonstrate the multiscale performance of the MTDS dataset, this study compared it with Pascal VOC2007 and SSDD in terms of candidate box proportion allocation. It can be seen from Fig. 5 that the ratio of the target length to the image size in MTDS is concentrated in the range of 0.02–0.48, which is much smaller than the range of 0.18–0.70 for the PASCAL VOC [39] object but bigger than the range of 0.02–0.22 for the SSDD. Therefore, MTDS includes the multiscale characteristics of remote sensing objects and is not limited to the detection of small objects such as ships, so that the network has the ability to accurately identify remote sensing objects of different scales.
Proportion distribution of candidate anchors in the image size in the dataset. There are 15 662 candidate anchors in the VOC2007 dataset; 2456 candidate anchors in SSDD; 8400 candidate anchors in MTDS.
B. Evaluation Criterion
In this article, we use precision rate P (precision), recall rate R (recall), AP (average precision), mAP (mean average precision), paras (network parameters), and frames per second (FPS) and calculation amount GFLOPs (Giga Floating-point Operations Per Second) as the evaluation indicator. The formulas for each evaluation metric are as follows:
\begin{align*}
P_{d}(\text{ Precision})&=\frac{\text{TP}}{\text{TP}+\text{FP}} \tag{10}
\\
R_{d}(\text{ Recall})&=\frac{\text{TP}}{\text{TP}+\text{FN}} \tag{11}
\end{align*}
\begin{align*}
\text{AP}_{d}&=\int _{0}^{1} P_{d}\left(R_{d}\right) d R_{d} \tag{12}
\\
\mathrm{mAP_{d}}&=\frac{1}{N_{c}} \sum _{i}^{N_{c}} \mathrm{(}{\text{AP}}_{i})_{d} \tag{13}
\\
\text{FPS}&=\frac{1}{\text{ Time }} \tag{14}
\end{align*}
AP is a metric calculated based on the precision–recall curve, which is used to measure the performance of object detection algorithms. mAP represents the average of AP for m categories, which can evaluate the detection ability of the trained model for all categories. The calculation formula is shown in formula (13). In this article,
C. Marine Environment Enhancement to Data
After adding complex weather environment noise to the ocean remote sensing images through algorithms, the original data contains various situations in the actual scene, as shown in Fig. 6. To verify that environment enhancement can improve the accuracy of the model, we use MTDS and NWPU VHR-10 as datasets to test the base model YOLOV5s (backbone is CSPDarkNet53) and HFPNet, and the experimental data obtained on the test set is shown in Table II, where
Comparison of data for simulating the complex weather environment at sea. (a) Original image. (b) Darkened image. (c) Brightened image. (d) Fogged image. (e) Blurred image, (f) Rain map.
From the experimental data, it can be seen that the networks with added environmental noise have improved testing accuracy after training. YOLOV5s(base) obtained 93.26% and 91.26%
However, the HFPNet in this article obtained 92.94% and 90.88%
D. Effect of Adding FEM on Network
With the deepening of the network, after a series of convolution pooling operations, the resolution of the feature map gradually decreases. At this time, although the network can extract rich deep features, it loses some shallow information. Especially for small objects in the marine environment, it is easy to miss or detect errors. In response to this problem, we embed the FEM module into different layers of the network to fully obtain multiscale context information, which alleviates the information loss caused by the reduction of the number of channels and the convolution pooling operation to a certain extent.
According to the data in Table III, different effects can be achieved by embedding FEM in different feature extraction layers of the network. YOLOV5s(base) and HFPNet achieved the highest
Comparison chart of test results after integrating FEM. Row a refers to YOLOV3, row b refers to YOLOV5-Mobile, row c refers to YOLOX, row d refers to PicoDet, row e refers to YOLOV5, and row f refers to HFPNet. The green arrow points to missed targets while the blue arrow points to incorrectly detected targets.
E. Impact of FEM and MFM on the Network
The results in Table III demonstrate the accuracy contribution of the proposed FEM, but the network still suffers from weak convergence and low ability in detecting multiscale objects. The designed MFM not only accelerates the convergence of the network but also enhances its ability to detect multiscale objects. We conducted ablation experiments between our model and the YOLOV5s(base) on VHR-10-A and MTDS-A datasets, as shown in Tables IV and V. It can be observed that the base network with FEM and MFM in Table IV outperforms the base network without the two modules, and the accuracy on
For object detection networks, the number of model parameters is an important factor that cannot be ignored. As the network model becomes more complex, its number of parameters and computational complexity increase accordingly. To reduce the network parameters and improve detection speed, methods such as model compression and pruning have been proposed but this often has adverse effects on small-object recognition. Therefore, this article proposes HFPNet, which combines FEM and MFM, and experiments with other advanced object detection models on the MTDS-A and VHR-10 datasets, as shown in Tables VI and VII. It can be seen that HFPNet achieves the best performance in terms of both detection accuracy and speed, demonstrating outstanding performance in marine remote sensing detection tasks. The loss and
Comparison of
Comparison chart of loss function during training and validation of different models.
Comparison chart of detection results for different networks on VHR-10-A and MTDS-A test set. The images in columns a, b, c, and d are from the VHR-10-A dataset while the images in columns e, f, and g are from the MTDS-A dataset.
F. Influence of Attention Mechanism on Ocean Remote Sensing Detection
According to the above experimental comparison, HFPNet has played the best performance in marine remote sensing object detection. But at present, most experiments prove that the embedding of the attention mechanism [1], [40], [41], [42], [43] can generally improve the performance of the network, but it will also produce low-precision results due to the destruction of the feature weights of the network. In order to explore the influence of the attention mechanism on HFPNet, we did some ablation experiments, and the experimental results are shown in Table VIII.
We found that different attention modules have little impact on the HFPNet parameters and inference time, and the generated computational load is also similar. For the GPU, it is almost negligible but the impact on the dataset accuracy is significant. The embedding of SE and CBAM modules leads to a decrease in model performance, while the embedding of the CA module improves the model's performance by 0.24% and 0.18% in
Performance comparison of object detection models. The result in a was obtained on the VHR-10-A test set while the result in B was obtained on the MTDS-A test set. The closer the mark is to the upper right corner, the better the performance.
Comparison of heatmaps under different scenarios. The first five rows of images are from VHR-10-A test set, and the last three rows are from MTDS-A test set. Column A shows the input image after environmental enhancement. Column B displays the heatmap generated by the model without the FEM and MFM modules. Column C presents the heatmap obtained by the model after the dual effects of FEM and MFM. Column D shows the heatmap obtained by embedding the CA attention mechanism in HFPNet.
G. Strengthen Network Post-Processing
The classic NMS algorithm is to select the bounding box [8] with the highest confidence from all the anchors as the reference and remove all the anchors whose IoU value with the reference is lower than the NMS threshold [20]. We can be seen that IoU is the only factor considered. However, in practical applications, when two different objects are close together, due to the relatively large IoU value, only one bounding box is left after NMS processing, which leads to missed detection errors. In response to the above problems, we improve NMS and use DIoU as a consideration factor. DIoU-NMS not only considers IoU but also considers the distance between the center points of two anchors. If the IoU between the two anchors is relatively large, but the distance between the two anchors is relatively large, the network will think that this is the anchor of two objects, thereby reducing the probability of missed detection of small maritime remote sensing objects. This article uses the NMS-DIoU method to test HFPNet on the MTDS-A verification set, and it has a certain improvement effect on dense small objects. The comparison experiment is shown in Fig. 13. Finally, HFPNet can improve the mAP of the test set by 0.34% by using the NMS-DIoU method.
HFPNet adopts NMS-DIoU method. The first row of images shows the original images, the second row displays the detection results of HFPNet using the NMS method, and the third row showcases the results obtained by HFPNet using the NMS-DIoU method. The green arrows indicate the locations of missed detection targets.
Conclusion
This article proposes a super feature pyramid network (HFPNet) to address the issue of missed detection and false detection in marine remote sensing object detection tasks. The FEM in the network generates spatial and channel attention coefficients. During training, the network enhances the weight information of objects based on attention coefficients while reducing the influence of noise, significantly improving the detection accuracy of marine remote sensing objects. For the multiscale features of marine objects, the MFM in the network integrates feature information from different levels of the network, improving the detection ability of multiscale objects. In addition, to address the problem of imbalanced positive and negative samples in remote sensing datasets, we redesigned the network's loss function. Due to the lack of datasets containing only marine targets, we constructed a marine remote sensing dataset MTDS and enhanced this dataset and the public dataset NWPU VHR-10 using our proposed method, resulting in improvements in
We further explored the impact of attention mechanisms on the model and conducted ablation experiments on the attention mechanisms in the network, resulting in an improvement of 0.24% and 0.18% in