Loading web-font TeX/Main/Regular
HFPNet: Super Feature Aggregation Pyramid Network for Maritime Remote Sensing Small-Object Detection | IEEE Journals & Magazine | IEEE Xplore

HFPNet: Super Feature Aggregation Pyramid Network for Maritime Remote Sensing Small-Object Detection


Abstract:

In recent years, with the development of the maritime industry, computer vision tasks based on optical remote sensing images have gained increasing attention in the field...Show More

Abstract:

In recent years, with the development of the maritime industry, computer vision tasks based on optical remote sensing images have gained increasing attention in the field of marine remote sensing. However, in complex marine meteorological environments, traditional detection methods often suffer from high rates of missed detections and false alarms for small targets. To address this issue, this article proposes a super feature pyramid network (HFPNet) for detecting marine remote sensing objects, which includes a feature enhancement module (FEM) and a multiscale feature aggregation module (MFM). The FEM can highlight features of small targets and weaken complex background features in detection, the MFM can share high-level and low-level features to better fuse multiscale feature information. Additionally, due to the lack of marine remote sensing object datasets, this article constructs a marine target detection dataset (MTDS) containing six types of marine objects. To address the issue of imbalanced positive and negative samples in the dataset, this article designs a network loss function to accelerate network convergence and improve the accuracy of small-target detection. Compared with other models, HFPNet achieved the highest \text{mAP}_{0.5} of 97.48% and 95.46% on the self-built dataset MTDS and the public dataset NWPU-VHR-10, respectively, after environmental enhancement. At the same time, it also achieved the fastest frames per second (FPS) on the test set. Finally, this article discusses the influence of attention mechanisms and postprocessing methods on HFPNet and obtains the best small-target detection model.
Page(s): 5973 - 5989
Date of Publication: 15 June 2023

ISSN Information:

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

In Recent years, with the development of remote sensing technology, object detection has reached a new level [1]. The technology of detecting ocean targets based on remote sensing images uses visual techniques to detect and monitor sea targets through satellite remote sensing images, which has the advantages of low cost, less interference, and less traceability compared to radar technology. The technology of detecting ocean targets based on remote sensing images can capture small targets such as warships and fighter jets from a long distance, accurately detecting these small targets is crucial for ensuring national security. Optical remote sensing images can capture maritime traffic facilities such as buoys and floats, as well as vessels that evade vessel traffic service (VTS) systems, which is important for safe navigation and maritime traffic management. In addition, optical remote sensing images can also capture floating objects, plankton, sediments, as well as minerals, petroleum, and other resources, which have important implications for marine environmental monitoring and resource exploration. Therefore, the significance of small-object detection in remote sensing optical images is crucial as it involves multiple fields, such as military security, maritime traffic safety, environmental monitoring, and resource exploration.

Currently, there are remote sensing image detection methods based on traditional features and deep learning. In traditional object detection methods, there are generally three stages: first, using the sliding window method to extract candidate boxes from the target image; then extracting features from these regions; finally, using a trained classifier for classification, such as Hog+SVM [2], Haar+Adaboost [3], [4], and DPM [5]. However, traditional methods lack robustness to the diversity of targets due to hand-designed features, resulting in slow detection speed and low accuracy, and thus cannot perform end-to-end detection. To address these issues, deep-learning-based object detection [6], [7], [8], [9] has rapidly developed, mainly divided into two-stage and one-stage approaches [10], [11], [12]. The two-stage algorithm first generates region proposals and then uses convolutional neural networks to classify the candidate regions. Therefore, the two-stage network [13], [14], [15] has high detection accuracy but slow inference speed, such as R-CNN [7], Fast R-CNN [10], Faster R-CNN [8], and Mask-CNN [12]. The one-stage algorithm does not require a region proposal stage [8], [9], [10] and directly generates the object's class probability and position coordinates, thus having a faster detection speed. Examples include single shot multibox detector (SSD) [13], RetinaNet [9], and you only look once (YOLO) [15], [16], [17], [18], [19], [20]. Although deep learning has made significant progress in object detection, there are still many shortcomings that need to be addressed in the field of small-target detection in ocean remote sensing. First, there are a large number of small targets in remote sensing data with pixel ranges between [10, 50]. After feature extraction by the network, the resolution of the feature map is reduced, causing small targets to disappear, making it difficult to accurately locate and detect them. Second, the marine weather environment is complex and diverse, and remote sensing images contain a lot of noise, such as rain and fog, which greatly interferes with the extraction of small-target features by the network. Generally, deep-learning-based networks show a decline in performance when processing remote sensing data, which further challenges the tasks of ocean remote sensing detection.

The evaluation of deep learning algorithms cannot be separated from the support of datasets. Currently, there are many marine remote sensing datasets available. NWPU VHR-10 [21] is a publicly available optical remote sensing dataset containing 10 types of remote sensing targets. Li et al. [22] proposed the Dior dataset, which is a publicly available optical remote sensing dataset containing 20 types of remote sensing targets. The HRSC2016 dataset [23] is a high-resolution marine remote sensing dataset containing only the ship category, sourced from six famous ports. However, these datasets do not only include marine targets or a single type of marine target, and the remote sensing datasets do not contain images in harsh marine weather conditions, resulting in low detection accuracy of marine targets with noise for remote sensing target detection networks. Therefore, we use algorithms to add noise of marine weather conditions to optical remote sensing images and construct the marine target detection dataset (MTDS) remote sensing dataset containing six marine target categories.

To further enhance the detection accuracy of small targets, most existing methods [17], [18], [19], [20] increase the resolution of the input image, adopt data augmentation, perform feature fusion, and recalculate anchor sizes to improve the recognition accuracy of small targets. Increasing the resolution of the image often leads to an increase in the computational complexity of the network, making the inference speed of the network slower; data augmentation can improve the recognition accuracy and robustness of the network, but the improvement in the recognition accuracy of small targets is not significant; feature fusion in the network is a good method to improve the recognition accuracy of small targets, such as FPN or FPN-based methods [24], [25], [26], but this method still loses context information in the shallow to deep layers of the network and introduces noise to interfere with gradient calculation, causing inconsistency in gradient calculation. BI-FPN [27] proposes a weighted bidirectional feature pyramid network, introducing weights of different scale features to better balance the scale feature information of different sizes of targets, but it exhibits inadaptability to different datasets; and the recalculation of anchors adapts the network candidate boxes from the perspective of the dataset, which will perform different calculations for different datasets. Although it improves the detection ability of small targets, the process is cumbersome.

In addition to poor performance in detecting small targets, there are two problems in remote sensing target detection, namely, the existence of multiscale targets in a complex background and the detection speed. In order to solve these problems, Ren et al. [8] abandoned the traditional sliding window [10] and selective search algorithms in the process of detecting targets, directly using RPN [8] to generate three different scales of anchors to detect targets. Compared with other two-stage object detection algorithms, it has faster detection speed and higher detection accuracy. However, the feature map extracted by this method has a low resolution and is not friendly to the detection of small targets. Moreover, the network finally uses fully connected layers, which produces a large number of parameters and cannot meet the requirements of real-time performance. In order to meet the requirements of speed and real-time performance, SSD [13] adopted the one-stage idea of YOLO [15], [16], [17] to directly classify and regress bounding boxes, and also borrowed the anchor mechanism from Faster R-CNN [8] to improve the detection accuracy. The fusion of these two methods has improved the accuracy and speed of detection. However, the number of anchors produced in SSD during the generation of low-level features is small compared to RetinaNet, which has a faster detection speed but lower accuracy. To further improve the recognition accuracy of multiscale targets, YOLOv3 [17] and RetinaNet [9] both use the feature pyramid method to fuse multiscale features and improve the detection ability of multiscale targets. However, the former produces fewer anchors than the latter, so it has a faster detection speed but lower accuracy.

The introduction of YOLOV5 [20] has led to simultaneous improvements in detection accuracy and speed. It adopts operations such as adaptive anchor calculation, FPN+PAN structure [20], and complete intersection over union (CIOU) loss function, which continuously strengthen the network's feature extraction and fusion capabilities, resulting in improved detection accuracy. At the same time, various lightweight models based on YOLOV5 have been proposed. Among them, PicoDet uses depthwise separable convolution and block stacking to significantly reduce the network's parameter count, making it easier to deploy on mobile devices, but it may increase the missed detection rate of small objects. In order to further improve the detection performance, YOLOX [19] modifies the detection head of the network by introducing the decoupled head structure and anchor-free method, and incorporates methods such as multipositives and SimOTA to improve the filtering of proposal boxes. However, introducing the decoupled head structure increases the network's computational complexity and slows down the model's running speed.

This article proposes a super feature pyramid network (HFPNet) shown in Fig. 1, which effectively improves the feature extraction ability for small targets without significantly increasing network parameters, thus improving the accuracy of remote sensing object detection compared to other methods. The feature enhancement module (FEM) in the network segments the input features in the channel dimension and weights the subfeature sets separately in space and channel to enhance the feature extraction ability for small targets and weaken the noise interference of nontarget background, further reducing the amount of information lost in the process of reducing feature map resolution. Since the sizes of ocean remote sensing targets differ significantly, to avoid the problem of good detection of large targets and poor detection of small targets, the multiscale feature aggregation module (MFM) module of this article aggregates the enhanced features output by FEM from multiple scales, further reducing the probability of small targets loss. In addition, in the testing phase, this article uses the CIOU-NMS [28] postprocessing optimization method to improve the test accuracy of ocean remote sensing target detection.

Fig. 1. - HFPNet structure. HFPNet consists of four parts: Input, backbone, neck, and head. The input part takes the image enhanced by environmental enhancement. The CBS module consists of convolution, batch normalization, and SiLU activation functions. The deep blue FEM and orange MFM are the feature enhancement and multiscale feature fusion modules proposed in this article. The structures of the C3, CA, and SPPFCSPC modules are shown on the right-hand side of the figure.
Fig. 1.

HFPNet structure. HFPNet consists of four parts: Input, backbone, neck, and head. The input part takes the image enhanced by environmental enhancement. The CBS module consists of convolution, batch normalization, and SiLU activation functions. The deep blue FEM and orange MFM are the feature enhancement and multiscale feature fusion modules proposed in this article. The structures of the C3, CA, and SPPFCSPC modules are shown on the right-hand side of the figure.

Our contributions are as follows.

  1. We proposed a novel HFPNet remote sensing object detection model, which includes FEM and MFM modules. In inference, the network can aggregate multiscale features based on the salient feature maps of the targets, improving the detection accuracy of remote sensing objects.

  2. We addressed the problem of imbalanced positive and negative samples in remote sensing data by using FocalLoss to dynamically balance classification samples and adding a decay factor \gamma \in [0,5] to dynamically balance the loss weights of easy and difficult samples. We also used EIOU as the regression loss to improve the detection accuracy of small remote sensing objects and speed up the convergence rate of the network.

  3. We validated various improved NMS postprocessing methods and ultimately used the distance intersection over union (DIoU)-NMS method to achieve state-of-the-art performance in remote sensing object detection experiments with HFPNet.

  4. We constructed a dataset for remote sensing targets (MTDS) and validated our model on the environment-enhanced MTDS and NWPU-VHR-10 datasets, achieving the best detection performance on each.

SECTION II.

Model Structure and Loss

In this section, we introduced the proposed method and the functionalities and implementation details of its modules. HFPNetmodel is a one-stage object detection model developed based on YOLOV5. The entire model consists of four parts: input, backbone, neck, and head, as shown in Fig. 1.

A. Structure of HFPNet

In terms of input, in addition to methods such as mosaic data augmentation, adaptive anchor box calculation, and adaptive image scaling, an algorithm is also used to simulate ocean weather conditions for the input remote sensing data, adding factors such as lighting, rain and fog, and blur to the network training to improve the model's robustness for ocean remote sensing target detection. For the backbone, we used CBL (Conv/BN/SiLU) and C3 structures for feature extraction and performed multiscale feature fusion in the last layer. In the neck part, we designed the FEM and MFM modules to enhance the network's feature extraction ability for remote sensing data and generalization ability for remote sensing multiscale targets. We also added a coordinate attention mechanism to focus on small targets in the front part of the small-target detection head.

B. Structure of FEM

Current research in the field of computer vision has proven that the method of adding channel attention and spatial attention can significantly improve the model. For example, SENet [29] automatically obtains the importance of each feature channel by learning and uses the obtained importance to enhance features and suppress features that are not important to the current task while the channel information of the object is important, the spatial structure cannot be ignored either. Therefore, the convolutional block attention module (CBAM) [30] combines the two-dimensional attention mechanism of feature channel and feature space. It not only considers important feature weighting and nonimportant feature suppression from the channel but also considers the spatial structure. The performance of the network is improved without significantly increasing the amount of computation and parameters. Although CBAM [30] tries to introduce position information by using a global pooling method on the channel, this method can only capture local information and cannot obtain long-range dependent [31], [32] information.

The FEM structure proposed in this article can enhance the feature extraction of the input data, weight the pixels of the object to be detected in the ocean, and, at the same time, weaken the weight of the background and nonobject pixels to improve the accuracy of object detection. The FEM structure is shown in Fig. 2.

Fig. 2. - FEM structure diagram. It adopts the method of channel segmentation to focus on the information of the channel branch and the spatial branch. The channel and spatial feature maps with added weight information are added to the input feature maps, and random feature maps are generated by shuffling the channels.
Fig. 2.

FEM structure diagram. It adopts the method of channel segmentation to focus on the information of the channel branch and the spatial branch. The channel and spatial feature maps with added weight information are added to the input feature maps, and random feature maps are generated by shuffling the channels.

First, the feature information X \in R^{C \times H \times W} extracted by the backbone outputs features of different scales through the adaptive pooling layer, the purpose is to alleviate the image distortion caused by the mismatch of the input image, and, at the same time, further enable the network to obtain the ability to accurately recognize objects of different scales. Second, after the output result x_{i} \in R^{C \times H \times W} of the pooling layer is grouped [33], two subfeatures x_{i j} \in R^{C / 2 \times H \times W} are obtained, these two subfeatures generate channel and spatial attention feature maps through the activation function, respectively. The former generates the attention coefficient of the relationship between channels, and the latter generates the spatial attention coefficient according to the relationship between the spatial feature layer. This enables the network to pay attention to which channels are worth learning and which is worth paying attention to. Next, the two subfeature maps are added to obtain a weighted feature map y_{i} \in R^{C \times H \times W} containing spatial and channel features, moreover, all y_{i} feature maps are concatenated with the input feature X, and channels whose threshold is lower than a given parameter are removed, in order to reduce the amount of calculation while ensuring the consistency of the input and output channels. Finally, the output features are shuffled [34] to improve the nonlinear ability of the network and reduce the probability of overfitting.

C. Structure of MFM

MFM mainly adopts multiple dilated convolutions in parallel to obtain receptive fields of different scales and aggregates features of each scale through global weight information. This method can enhance the robustness of the network to multiscale objects in the ocean, thereby improving the accuracy of the model in remote sensing object detection. The MFM structure is shown in Fig. 3.

Fig. 3. - MFM structure, CBR module refers to convolution, batch normalization, and ReLu activation function. The CBR module refers to the concatenation of convolutional layers, normalization layers, and activation layers.
Fig. 3.

MFM structure, CBR module refers to convolution, batch normalization, and ReLu activation function. The CBR module refers to the concatenation of convolutional layers, normalization layers, and activation layers.

The specific process is the features after the backbone and neck are spliced and, then, through four dilated convolutional layers of different scales and a global average pooling layer. The purpose of multiscale atrous convolution [35] is to obtain receptive fields of different sizes of input features to improve the multiscale detection ability of marine remote sensing objects. The purpose of global average pooling is to make the output approximate to the input features to the greatest extent while obtaining a global receptive field, subsequently, the output weight vector reduces the dimensionality by an attenuation factor to reduce the computational load of the network, and outputs 1*1*C weight information after sigmoid activation, which is similar to the process of SENet [29] channel attention. Finally, this module multiplies the output weight information with the feature maps of each scale and stitches them together in the channel dimension and then adjusts the dimensions through the CBR module [20] and, finally, outputs the feature map after multiscale aggregation.

D. HFPNet Loss Function

The loss of HFPNet is mainly composed of three parts: 1) classification loss; 2) localization loss; and 3) confidence loss. Among them, the classification loss uses the focal loss function, the localization loss uses the EIoU loss [9], and BCE With logits loss is used as the confidence loss.

1) Classification Loss Function

Recently, the detection performance of the YOLOV5 network on various conventional datasets has reached an advanced level, but there is still room for improvement in the field of remote sensing. The YOLOV5 model takes the binary classification into consideration in the classification and uses BCE With Logits Loss as the classification loss function. See formulas (1) and (2) for the loss function \begin{align*} \mathrm{P}=&\log (\sigma (x))=\log \left(\frac{1}{1+e^{-x}}\right) \tag{1} \\ \text{loss}=&-\frac{1}{n} \sum _{x}[y \ln P+(1-y) \ln (1-P)] \tag{2} \end{align*}

View SourceRight-click on figure for MathML and additional features.where x is the sample and y is the true label, which takes the value 0 or 1. P is the probability of the output of the network through the sigmoid function, the range is (0, 1), and n is the number of samples.

Although the BCE with logits loss has a fast convergence speed, it loses performance when controlling the weight of positive and negative samples and controlling the classification weights of easy samples (the difference between the predicted value and the real value is small) and difficult samples (the difference between the predicted value and the real value is large). In the one-stage object detection, the target picture may generate thousands of anchor boxes, but the anchor boxes (positive samples) that can match the object are generally only a dozen or even 20, and the rest will be designated as negative samples.

Negative samples can produce small losses but the total loss can still be highly disruptive to the training weights of positive samples. In response to the aforementioned problems, this article introduces the focal loss function [9], which can dynamically balance the uneven proportion of positive and negative samples in the training process through a dynamic scaling factor \alpha \in [0,1], among them, the size of the weight factor is affected by the proportion of the opposite class, that is, the more negative samples, the smaller the weight given to it, the definition is shown in formula (4). The cross-entropy loss function with the ability to balance positive and negative samples is shown in \begin{align*} \text{CE}\left(p_{t}\right)=&-\log \left(p_{t}\right) \tag{3} \\ \alpha _{t}=&\left\lbrace \begin{array}{c}\alpha, \mathbf{if} y=1 \\ 1-\alpha, \text{otherwise} \end{array}\right. \tag{4} \\ \text{FL}\left(p_{t}\right)=&-\alpha _{t} \log \left(p_{t}\right) \tag{5} \end{align*}

View SourceRight-click on figure for MathML and additional features.p_{t} is the probability value of the cross-entropy loss function, and formula (3) is the cross-entropy loss.

In addition, simple negative samples will make a major contribution to the loss of the network and will dominate the update direction of the gradient, making it impossible to accurately classify objects. Therefore, the focal loss function adds a \gamma \in [0,5] factor, which reduces the loss of easy-to-classify samples and increases the loss of difficult and misclassified samples. The final focal loss function is shown in \begin{equation*} \text{FL}\left(p_{t}\right)=-\alpha _{t}\left(1-p_{t}\right)^\gamma \log \left(p_{t}\right). \tag{6} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

2) Localization Loss Function

The generalized intersection over union (GIOU) positioning loss function [36] solves the problem of consistent loss when the traditional intersection over union (IOU)=0 or multiple anchor boxes have the same IOU [37] value, thereby accelerating the convergence speed of the loss, as shown in the second column of Fig. 4, but in GIOU, when the real box contains the predicted box, it degenerates into the original IoU, and the relative position problem is not solved. DIoU loss [28] adds relative position information on the basis of GIoU, and judges the loss according to the distance. However, according to the third column of Fig. 4, it can be seen that DIoU is inversely proportional to the diagonal distance c of the rectangle (orange box). When the distance between the center points of the two bounding boxes [8] is constant, the longer the diagonal of the rectangle, the smaller the value of the DIoU loss function. Therefore, the problem of inaccuracy may occur when detecting small-object samples.

Fig. 4. - Schematic diagram of various localization loss functions.
Fig. 4.

Schematic diagram of various localization loss functions.

Although CIoU [28] adds the aspect ratio V of the predicted anchor and the real anchor, which is defined as formula (7), when the aspect ratio of the predicted anchor and the real anchor is linearly proportional, the penalty item R_{\text{CIoU}} in CIoU will no longer work. See (8) for the loss function of CIoU \begin{align*} V&=\frac{4}{\pi ^{2}}\left(\arctan \frac{w^{g t}}{h^{g t}}-\arctan \frac{w^{p}}{h^{p}}\right) \tag{7} \\ R_{\text{CIoU}}&=\frac{d^{2}}{L^{2}}+\frac{V^{2}}{(1-\text{IoU})+V}. \tag{8} \end{align*}

View SourceRight-click on figure for MathML and additional features.

Among them, w^{g t} and h^{g t} refer to the width and height of the real anchor, respectively, and w^{p} and h^{p} refer to the width and height of the predicted anchor, respectively. L and d are the Euclidean distance between the predicted anchor and the center point of the real anchor and the diagonal distance of the minimum circumscribed rectangle, respectively.

In order to further improve the accuracy of maritime remote sensing object detection, we adopt the EIoU loss [38], as shown in the fifth column of Fig. 4, which considers the overlapping area between the predicted anchor and the real anchor, the center point distance, and the loss of width and height. The loss of width and height directly minimizes the difference between the width and height of the predicted anchor and the real anchor, making the convergence faster and the regression position more accurate. Its loss function is shown in \begin{equation*} L_{\text{EIoU}}=1-\text{IOU}+\frac{d^{2}}{L^{2}}+\frac{\rho ^{2}\left(w, w^{g t}\right)}{C_{w}^{2}}+\frac{\rho ^{2}\left(h, h^{g t}\right)}{C_{h}^{2}}. \tag{9} \end{equation*}

View SourceRight-click on figure for MathML and additional features.Among them, C_{w} and C_{h} are the width and height of the minimum bounding rectangle of the predicted anchor and the real anchor, respectively.

SECTION III.

Experiments and Results

A. Experimental Environment and Data Preparation

This experiment is based on the torch deep learning framework and is carried out in the GPU environment during training and testing. The specific environment configuration is shown in Table I.

TABLE I Experimental Environment Configuration Table
Table I- Experimental Environment Configuration Table

In terms of datasets, this article used the publicly available remote sensing dataset NWPU VHR-10, which contains 800 remote sensing images with a total of 10 categories of objects, including ships, ports, bridges, etc. In addition, the MTDS dataset, which contains 800 remote sensing images, was constructed in this article. The MTDS dataset consists exclusively of marine objects such as ships, ports, bridges, dams, islands, and wind turbines, with respective image quantities of 240, 130, 80, 80, 130, and 140. The image data is sourced from Google Earth. About 40% of the remote sensing images from both datasets were randomly subjected to environmental enhancement, resulting in a dataset of 2400 marine remote sensing images, which were then divided into training, validation, and test sets with a ratio of 7:2:1.

To simulate the complex environmental conditions often encountered in the ocean, this article added blur, lighting intensity, and rain and fog factors to the dataset using algorithms, with the aim of making the model more robust. Specifically, OpenCV was used to perform environmental enhancement operations on remote sensing images, controlling noise levels such as lighting with uniform random numbers and thresholds. Random noise, filters, and other methods were used to add rain and fog effects to the images while Gaussian motion blur was used to simulate realistic motion blur effects.

To demonstrate the multiscale performance of the MTDS dataset, this study compared it with Pascal VOC2007 and SSDD in terms of candidate box proportion allocation. It can be seen from Fig. 5 that the ratio of the target length to the image size in MTDS is concentrated in the range of 0.02–0.48, which is much smaller than the range of 0.18–0.70 for the PASCAL VOC [39] object but bigger than the range of 0.02–0.22 for the SSDD. Therefore, MTDS includes the multiscale characteristics of remote sensing objects and is not limited to the detection of small objects such as ships, so that the network has the ability to accurately identify remote sensing objects of different scales.

Fig. 5. - Proportion distribution of candidate anchors in the image size in the dataset. There are 15 662 candidate anchors in the VOC2007 dataset; 2456 candidate anchors in SSDD; 8400 candidate anchors in MTDS.
Fig. 5.

Proportion distribution of candidate anchors in the image size in the dataset. There are 15 662 candidate anchors in the VOC2007 dataset; 2456 candidate anchors in SSDD; 8400 candidate anchors in MTDS.

B. Evaluation Criterion

In this article, we use precision rate P (precision), recall rate R (recall), AP (average precision), mAP (mean average precision), paras (network parameters), and frames per second (FPS) and calculation amount GFLOPs (Giga Floating-point Operations Per Second) as the evaluation indicator. The formulas for each evaluation metric are as follows: \begin{align*} P_{d}(\text{ Precision})&=\frac{\text{TP}}{\text{TP}+\text{FP}} \tag{10} \\ R_{d}(\text{ Recall})&=\frac{\text{TP}}{\text{TP}+\text{FN}} \tag{11} \end{align*}

View SourceRight-click on figure for MathML and additional features.where d represents the IoU threshold used to determine whether a detection result is true positive or false positive. In addition, R_{d} represents the recall rate and P_{d} represents the precision rate. TP denotes the number of true positives, FP denotes the number of false positives, and FN denotes the number of false negatives \begin{align*} \text{AP}_{d}&=\int _{0}^{1} P_{d}\left(R_{d}\right) d R_{d} \tag{12} \\ \mathrm{mAP_{d}}&=\frac{1}{N_{c}} \sum _{i}^{N_{c}} \mathrm{(}{\text{AP}}_{i})_{d} \tag{13} \\ \text{FPS}&=\frac{1}{\text{ Time }} \tag{14} \end{align*}
View SourceRight-click on figure for MathML and additional features.
where N_{c} is the number of categories and \text{AP}_{i} is AP of category i.

AP is a metric calculated based on the precision–recall curve, which is used to measure the performance of object detection algorithms. mAP represents the average of AP for m categories, which can evaluate the detection ability of the trained model for all categories. The calculation formula is shown in formula (13). In this article, \text{mAP}_{0.5} is calculated at an IoU threshold of 0.5, and \text{mAP}_{0.5:0.95} represents the average mAP calculated at different IoU thresholds (from 0.5 to 0.95 with a step of 0.05). The time (s) and FPS of one optical remote sensing image detection are defined by formula (14).

C. Marine Environment Enhancement to Data

After adding complex weather environment noise to the ocean remote sensing images through algorithms, the original data contains various situations in the actual scene, as shown in Fig. 6. To verify that environment enhancement can improve the accuracy of the model, we use MTDS and NWPU VHR-10 as datasets to test the base model YOLOV5s (backbone is CSPDarkNet53) and HFPNet, and the experimental data obtained on the test set is shown in Table II, where -A indicates that the dataset has been subjected to environment enhancement processing, and -B indicates that no environment enhancement has been performed on the dataset. Both YOLOV5(base) and HFPNet are networks without FEM and MFM. According to the results in Table II, environment enhancement increases the diversity of training data, making the model more adaptable to real environmental changes, ultimately improving the generalization ability and robustness of the model, thereby increasing the accuracy of object detection.

TABLE II Table Comparing Data Results Before and After Environmental Augmentation of the Dataset
Table II- Table Comparing Data Results Before and After Environmental Augmentation of the Dataset
Fig. 6. - Comparison of data for simulating the complex weather environment at sea. (a) Original image. (b) Darkened image. (c) Brightened image. (d) Fogged image. (e) Blurred image, (f) Rain map.
Fig. 6.

Comparison of data for simulating the complex weather environment at sea. (a) Original image. (b) Darkened image. (c) Brightened image. (d) Fogged image. (e) Blurred image, (f) Rain map.

From the experimental data, it can be seen that the networks with added environmental noise have improved testing accuracy after training. YOLOV5s(base) obtained 93.26% and 91.26% \text{mAP}_{0.5} in MTDS-A and VHR-10-A, respectively, which is 0.61% and 0.42% higher than the accuracy obtained in MTDS-B and VHR-10-B.

However, the HFPNet in this article obtained 92.94% and 90.88% \text{mAP}_{0.5} in MTDS-A and VHR-10-A, respectively, which increased the accuracy by 0.83% and 0.71% compared with MTDS-B and VHR-10-B, respectively. In the follow-up experiments, we use MTDS-A and VHR-10-A as the experimental datasets. The initial learning rate of HFPNet is set to 0.01, the momentum is set to 0.937, the input image size is about 640×640, the batch size is set to 8, the training iteration is 300 times, and the ablation experiment is compared with the multiclass object detection model.

D. Effect of Adding FEM on Network

With the deepening of the network, after a series of convolution pooling operations, the resolution of the feature map gradually decreases. At this time, although the network can extract rich deep features, it loses some shallow information. Especially for small objects in the marine environment, it is easy to miss or detect errors. In response to this problem, we embed the FEM module into different layers of the network to fully obtain multiscale context information, which alleviates the information loss caused by the reduction of the number of channels and the convolution pooling operation to a certain extent.

According to the data in Table III, different effects can be achieved by embedding FEM in different feature extraction layers of the network. YOLOV5s(base) and HFPNet achieved the highest \text{mAP}_{0.5} and \text{mAP}_{0.5:0.95} when embedding FEM in the 4th, 6th, and 8th layers of the network. More importantly, the HFPNet embedding FEM in the 4th, 6th, and 8th layers achieved the highest \text{mAP}_{0.5} and \text{mAP}_{0.5:0.95}, with 95.21% and 63.47%, respectively, compared to other models. It increased by 1.29% and 1.14% on \text{mAP}_{0.5} and \text{mAP}_{0.5:0.95}, respectively, compared to the base model with the highest performance. It can be seen that the proposed FEM played an important role, and the weighted strategy in FEM also played an important role in feature extraction. The visualization results of different models on the test set are shown in Fig. 7. Generally speaking, the finite-element method will enhance the feature extraction of positive samples and weaken the feature of background and other negative samples. It is worth noting that the number of FEMs is not proportional to network performance. Most models will perform better by embedding FEM in the 4th, 6th, and 8th layers of the backbone, but after adding FEM to the 4th and 6th layers, the accuracy of the YOLOV5s GSConv model was higher than adding FEM to the 4th, 6th, and 8th layers. Therefore, it is necessary to conduct ablation experiments of embedding FEM on different networks.

TABLE III Experimental Results of Ablation of FEM in Different Layers of Detection Networks on VHR-10-A
Table III- Experimental Results of Ablation of FEM in Different Layers of Detection Networks on VHR-10-A
Fig. 7. - Comparison chart of test results after integrating FEM. Row a refers to YOLOV3, row b refers to YOLOV5-Mobile, row c refers to YOLOX, row d refers to PicoDet, row e refers to YOLOV5, and row f refers to HFPNet. The green arrow points to missed targets while the blue arrow points to incorrectly detected targets.
Fig. 7.

Comparison chart of test results after integrating FEM. Row a refers to YOLOV3, row b refers to YOLOV5-Mobile, row c refers to YOLOX, row d refers to PicoDet, row e refers to YOLOV5, and row f refers to HFPNet. The green arrow points to missed targets while the blue arrow points to incorrectly detected targets.

E. Impact of FEM and MFM on the Network

The results in Table III demonstrate the accuracy contribution of the proposed FEM, but the network still suffers from weak convergence and low ability in detecting multiscale objects. The designed MFM not only accelerates the convergence of the network but also enhances its ability to detect multiscale objects. We conducted ablation experiments between our model and the YOLOV5s(base) on VHR-10-A and MTDS-A datasets, as shown in Tables IV and V. It can be observed that the base network with FEM and MFM in Table IV outperforms the base network without the two modules, and the accuracy on \text{mAP}_{0.5} and \text{mAP}_{0.5:0.95} is improved from 91.26% and 62.13% to 94.70% and 62.36%, respectively, with an increase of 3.44% and 0.23% in detection accuracy. Similarly, the HFPNet with the two modules also shows improvement in detection accuracy by 4.4% and 1.68% on these two evaluation metrics. In Table V, both the base model and HFPNet are affected by FEM and MFM, leading to an improvement of 2.44% and 8.47% in \text{mAP}_{0.5} and \text{mAP}_{0.5:0.95} for the base model with the two modules compared to the original one. The HFPNet also shows improvement by 4.3% and 11.71% on these two metrics. Note that the FEM module is embedded in the 4th, 6th, and 8th layers of the backbone.

TABLE IV Results of Ablation Experiments With the Addition of FEM and MFM to the Network on VHR-10-A Dataset
Table IV- Results of Ablation Experiments With the Addition of FEM and MFM to the Network on VHR-10-A Dataset
TABLE V Results of Ablation Experiments With the Addition of FEM and MFM to the Network on MTDS-A Dataset
Table V- Results of Ablation Experiments With the Addition of FEM and MFM to the Network on MTDS-A Dataset

For object detection networks, the number of model parameters is an important factor that cannot be ignored. As the network model becomes more complex, its number of parameters and computational complexity increase accordingly. To reduce the network parameters and improve detection speed, methods such as model compression and pruning have been proposed but this often has adverse effects on small-object recognition. Therefore, this article proposes HFPNet, which combines FEM and MFM, and experiments with other advanced object detection models on the MTDS-A and VHR-10 datasets, as shown in Tables VI and VII. It can be seen that HFPNet achieves the best performance in terms of both detection accuracy and speed, demonstrating outstanding performance in marine remote sensing detection tasks. The loss and \text{mAP}_{0.5} accuracy visualization comparison charts of various detection networks trained on the VHR-10-A are shown in Figs. 8 and 9. It can be seen that our model achieves the highest accuracy and the fastest convergence speed. Partial output visualization results are shown in Fig. 10.

TABLE VI Table of Experimental Results of Different Detection Models on VHR-10-A Test Set
Table VI- Table of Experimental Results of Different Detection Models on VHR-10-A Test Set
TABLE VII Table of Experimental Results of Different Detection Models on MTDS-A Test Set
Table VII- Table of Experimental Results of Different Detection Models on MTDS-A Test Set
Fig. 8. - Comparison of $\text{mAP}_{0.5}$ indicators during the training process for different models. Panel a shows results obtained on the NWPU-10 dataset while panel b shows results on the MTDS-A dataset.
Fig. 8.

Comparison of \text{mAP}_{0.5} indicators during the training process for different models. Panel a shows results obtained on the NWPU-10 dataset while panel b shows results on the MTDS-A dataset.

Fig. 9. - Comparison chart of loss function during training and validation of different models.
Fig. 9.

Comparison chart of loss function during training and validation of different models.

Fig. 10. - Comparison chart of detection results for different networks on VHR-10-A and MTDS-A test set. The images in columns a, b, c, and d are from the VHR-10-A dataset while the images in columns e, f, and g are from the MTDS-A dataset.
Fig. 10.

Comparison chart of detection results for different networks on VHR-10-A and MTDS-A test set. The images in columns a, b, c, and d are from the VHR-10-A dataset while the images in columns e, f, and g are from the MTDS-A dataset.

F. Influence of Attention Mechanism on Ocean Remote Sensing Detection

According to the above experimental comparison, HFPNet has played the best performance in marine remote sensing object detection. But at present, most experiments prove that the embedding of the attention mechanism [1], [40], [41], [42], [43] can generally improve the performance of the network, but it will also produce low-precision results due to the destruction of the feature weights of the network. In order to explore the influence of the attention mechanism on HFPNet, we did some ablation experiments, and the experimental results are shown in Table VIII.

TABLE VIII Comparison Table of Experimental Results of Adding Different Attention Modules to HFPNet
Table VIII- Comparison Table of Experimental Results of Adding Different Attention Modules to HFPNet

We found that different attention modules have little impact on the HFPNet parameters and inference time, and the generated computational load is also similar. For the GPU, it is almost negligible but the impact on the dataset accuracy is significant. The embedding of SE and CBAM modules leads to a decrease in model performance, while the embedding of the CA module improves the model's performance by 0.24% and 0.18% in \text{mAP}_{0.5} and by 0.14% and 1.30% in \text{mAP}_{0.5:0.95}. The performance comparison between HFPNet with embedded attention mechanism and other models is shown in Fig. 11, which demonstrates that HFPNet achieves the best results in \text{mAP}_{0.5} and FPS on the test set. To demonstrate the model's performance improvement more clearly, we plotted the heatmaps of the model's inference process, as shown in heatmap of Fig. 12.

Fig. 11. - Performance comparison of object detection models. The result in a was obtained on the VHR-10-A test set while the result in B was obtained on the MTDS-A test set. The closer the mark is to the upper right corner, the better the performance.
Fig. 11.

Performance comparison of object detection models. The result in a was obtained on the VHR-10-A test set while the result in B was obtained on the MTDS-A test set. The closer the mark is to the upper right corner, the better the performance.

Fig. 12. - Comparison of heatmaps under different scenarios. The first five rows of images are from VHR-10-A test set, and the last three rows are from MTDS-A test set. Column A shows the input image after environmental enhancement. Column B displays the heatmap generated by the model without the FEM and MFM modules. Column C presents the heatmap obtained by the model after the dual effects of FEM and MFM. Column D shows the heatmap obtained by embedding the CA attention mechanism in HFPNet.
Fig. 12.

Comparison of heatmaps under different scenarios. The first five rows of images are from VHR-10-A test set, and the last three rows are from MTDS-A test set. Column A shows the input image after environmental enhancement. Column B displays the heatmap generated by the model without the FEM and MFM modules. Column C presents the heatmap obtained by the model after the dual effects of FEM and MFM. Column D shows the heatmap obtained by embedding the CA attention mechanism in HFPNet.

G. Strengthen Network Post-Processing

The classic NMS algorithm is to select the bounding box [8] with the highest confidence from all the anchors as the reference and remove all the anchors whose IoU value with the reference is lower than the NMS threshold [20]. We can be seen that IoU is the only factor considered. However, in practical applications, when two different objects are close together, due to the relatively large IoU value, only one bounding box is left after NMS processing, which leads to missed detection errors. In response to the above problems, we improve NMS and use DIoU as a consideration factor. DIoU-NMS not only considers IoU but also considers the distance between the center points of two anchors. If the IoU between the two anchors is relatively large, but the distance between the two anchors is relatively large, the network will think that this is the anchor of two objects, thereby reducing the probability of missed detection of small maritime remote sensing objects. This article uses the NMS-DIoU method to test HFPNet on the MTDS-A verification set, and it has a certain improvement effect on dense small objects. The comparison experiment is shown in Fig. 13. Finally, HFPNet can improve the mAP of the test set by 0.34% by using the NMS-DIoU method.

Fig. 13. - HFPNet adopts NMS-DIoU method. The first row of images shows the original images, the second row displays the detection results of HFPNet using the NMS method, and the third row showcases the results obtained by HFPNet using the NMS-DIoU method. The green arrows indicate the locations of missed detection targets.
Fig. 13.

HFPNet adopts NMS-DIoU method. The first row of images shows the original images, the second row displays the detection results of HFPNet using the NMS method, and the third row showcases the results obtained by HFPNet using the NMS-DIoU method. The green arrows indicate the locations of missed detection targets.

SECTION IV.

Conclusion

This article proposes a super feature pyramid network (HFPNet) to address the issue of missed detection and false detection in marine remote sensing object detection tasks. The FEM in the network generates spatial and channel attention coefficients. During training, the network enhances the weight information of objects based on attention coefficients while reducing the influence of noise, significantly improving the detection accuracy of marine remote sensing objects. For the multiscale features of marine objects, the MFM in the network integrates feature information from different levels of the network, improving the detection ability of multiscale objects. In addition, to address the problem of imbalanced positive and negative samples in remote sensing datasets, we redesigned the network's loss function. Due to the lack of datasets containing only marine targets, we constructed a marine remote sensing dataset MTDS and enhanced this dataset and the public dataset NWPU VHR-10 using our proposed method, resulting in improvements in \text{mAP}_{0.5} for both base and HFPNet models on these two datasets. Finally, we compared all experimental models based on multiple evaluation metrics, and the experimental results showed that the complete HFPNet obtained the highest \text{mAP}_{0.5} and \text{mAP}_{0.5:0.95} results on the MTDS-A and VHR-10-A datasets, respectively, with improvements of 3.98% and 11.53% on the MTDS-A dataset and 4.02% and 1.48% on the VHR-10-A dataset compared to the base model, which also achieved the highest FPS value on the test set.

We further explored the impact of attention mechanisms on the model and conducted ablation experiments on the attention mechanisms in the network, resulting in an improvement of 0.24% and 0.18% in \text{mAP}_{0.5} on the two datasets, respectively. Finally, to further improve the model's prediction ability, we used the NMS-DIoU strategy in the network's postprocessing, resulting in a 0.34% increase in \text{mAP}_{0.5} on the test set. The experiments demonstrate that HFPNet can improve the detection performance of small-scale ocean remote sensing objects but there is still a risk of missed detection in the process of handling dense small objects. To ensure the model's performance in ocean remote sensing, we will conduct further research on the method of rotation object detection in ocean remote sensing object detection.

References

References is not available for this document.