Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 17

Improved YOLOv5s With Coordinate Attention for Small and Dense Object Detection From Optical Remote Sensing Images

Abstract:

The objects in optical high-resolution remote sensing images (HRRSIs) are usually tiny, dense, and exist in complex backgrounds, which brings great challenges to accurate...Show More

Metadata

Abstract:

The objects in optical high-resolution remote sensing images (HRRSIs) are usually tiny, dense, and exist in complex backgrounds, which brings great challenges to accurate object detection. This article presents an improved YOLOv5s network-based technique for remote sensing object recognition to overcome these difficulties. First, unnecessary residual modules are pruned from the cross-stage partial layer of conventional YOLOv5s and a refined residual coordinate attention module is incorporated to enhance the feature representation of the densely packed small objects in HRRSIs by introducing the residual structure and the mix pooling operation instead of the existing average pooling. Second, since various scales of objects are present in HRRSIs, the algorithm of differential evolution is adopted to replace the traditional K-means for generating a variety of anchor boxes in different sizes. Third, we replace the commonly used complete intersection over union (IoU) loss function in YOLOv5s with the AW-IoU loss function based on both α-IoU and wise-IoU. AW-IoU could expedite bounding box regression and focus more on regular anchor boxes. Finally, instead of nonmaximum suppression (NMS), the SCYLLA (S-IoU) soft-NMS is employed to eliminate the redundant duplicate boxes to detect the dense objects in remote sensing images. Experimental results on the NWPU VHR-10 dataset demonstrate that the proposed YOLOv5s method performs well compared with state-of-the-art algorithms.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 17)

Page(s): 2543 - 2556

Date of Publication: 18 December 2023

ISSN Information:

DOI: 10.1109/JSTARS.2023.3341628

Funding Agency:

Contents

SECTION I.

Introduction

Recent years have witnessed the rapid advancements in remote sensing technologies in many applications, such as precision agriculture, traffic monitoring, and military reconnaissance. High-resolution remote sensing images (HRRSIs) are widely utilized in these fields, providing highly detailed information. Object detection in HRRSIs has been one of the hot topics in these application fields. However, remote sensing objects are usually small sized and densely distributed, resulting in missed or duplicate detections due to complex backgrounds caused by various factors, such as weather, illumination, and oceanic phenomena [1], [2]. Consequently, remote sensing object detection tends to suffer from undesired accuracy since densely packed small objects are not effectively represented in the feature domain.

There have been numerous solutions to tackle the aforementioned challenges in remote sensing object detection [3]. Conventional methods often employ machine learning techniques, yet the feature extraction process typically entails manual tuning [4]. In contrast, deep-learning-based approaches make great progress in the accurate object detection and can be primarily divided into two categories: two-stage algorithms and one-stage algorithms [5]. The conventional two-stage algorithms include region convolutional neural networks (R-CNNs) [6], fast R-CNN [7], faster R-CNN [8], and mask R-CNN [9]. Although these algorithms possess high object detection accuracy, they suffer from the drawbacks of slow speed and losing spatial information of local objects in the entire image. On the other hand, the one-stage algorithm involves a single-shot multibox detector (SSD) [10], [11], you only look once (YOLO) methods [12], [13], [14], etc. These algorithms are fast in detecting objects, but the object accuracy is not very high. There are five variants of object detection methods for YOLOv5, among which the small model of YOLOv5s achieves a better balance between object detection speed and accuracy. In recent years, transformers-based methods [15], [16], [17] have emerged as a novel approach to object detection.

Although the YOLOv5s method has achieved considerable success in remote sensing object detection, the small and dense objects are prone to be missed or duplicately detected with inaccurate prediction boxes when it is applied to HRRSIs with complex backgrounds. For example, the storage tanks and small-sized aircraft in Fig. 1(a) are densely distributed in the remote sensing images. For the storage tanks and aircraft situated at the center of the image, there are many duplicate and inaccurate prediction boxes (marked by two red circles) by the original YOLOv5s method. Another example is provided in Fig. 1(b), the car shadowed in the lower left corner of the image is missed (marked with a red circle) in the object detection results due to the low contrast intensities in the image. Thus, the feature extraction process of traditional YOLOv5s detection network can be challenging, leading to the missed or duplicate detections when dealing with densely scattered tiny objects in complex environments. To tackle these issues, this article proposes an enhanced YOLOv5s algorithm. The main contributions of this study can be summarized as follows.

We propose an improved YOLOv5s network by pruning the redundant residual blocks in the original cross-stage partial (CSP) layer [18] to reduce the parameters. Besides, to enhance the feature representation ability, a novel residual coordinate attention (RCA) module is designed by introducing the residual structure and mix pooling operation instead of the existing average pooling and incorporated into CSP without extra trainable parameters.
A new anchor box generation algorithm is introduced based on the differential evolution (DE) algorithm instead of the original K-means technique to produce the various sizes of anchor boxes for improving HRRSI object detection accuracy.
We proposed a novel loss function named AW-IoU to substitute the complete intersection over union (CIoU) loss in YOLOv5s. AW-IoU not only focuses on most of the regular anchor boxes other than extremely large or tiny anchor boxes but also accelerates the convergence speed of training the proposed method.
SCYLLA-IOU (S-IoU) is integrated into soft nonmaximum suppression (NMS) to take place of NMS in YOLOv5s so as to mitigate the redundant detections of small and dense targets in HRRSIs.

Fig. 1.

Challenges are encountered when detecting small, densely populated objects within HRRSIs that feature complex backgrounds. Specifically, (a) storage tanks and small-sized aircraft are subject to duplicate detection and inaccurately predicted bounding boxes and (b) objects, such as the car in the lower left corner, which are shadowed or otherwise obscured, often go undetected.

Show All

The rest of this article is organized as follows. Section II briefly introduces the related works. Section III details our improved YOLOv5s method. Section IV performs qualitative and quantitative comparisons with state-of-the-art methods. Finally, Section V concludes this article.

SECTION II.

Related Work

This section reviews the deep-learning-based two-stage and one-stage object detection methods, YOLO methods, and attention mechanism (AM) associated with HRRSI object detection.

A. HRRSIs Object Detection Via Deep Learning

Deep-learning-based object detection methods in HRRSIs can be roughly divided into two categories: two-stage and one-stage, according to their detection steps. In the two-stage approach, the first step involves region proposal, followed by the second step of classifying these proposals via convolutional neural networks (CNNs). Classic two-stage methods for HRRSI object detection encompass R-CNN [6], fast R-CNN [7], faster R-CNN [8], mask R-CNN [9], and so on. Despite achieving relatively high accuracy, their considerable computational cost renders them impractical for real-time HRRSI object detection.

In contrast, one-stage methods unify bounding box localization and regression in an end-to-end fashion. The representatives include SSD [10] and YOLO methods [12], [13], [14]. Wang et al. [11] propose the feature-merged single-shot detection network, which fuses contextual information from multiscale features and includes a novel area loss function. This function, which decreases monotonically relative to object area, emphasizes smaller objects and improves prediction speed. Compared with two-stage methods, one-stage techniques excel in object detection speed with a modest accuracy improvement. As a result, they have attracted significant attention from researchers for HRRSI object detection.

B. HRRSIs Object Detection Based on YOLO Methods

YOLO methods perform efficiently, thanks to their one-stage structure, which is widely used to detect objects from remote sensing images. Xu and Wu [19] propose an enhanced YOLOv3 method employing a densely connected network (DenseNet) to augment the feature extraction ability. Similarly, Cao et al. [20] introduce an improved object detection method for HRRSIs based on YOLOv4 [21] by incorporating the pyramid pooling module [22] and substituting the original activation function with Mish [23]. Zhang et al. [24] design a SuperYOLO by incorporating a multimodality fusion module that integrates RGB and infrared images, with an added super-resolution branch assisting in the detection of small objects in HRRSIs. Although a lightweight network can accelerate the inference speed, the detection accuracy for densely distributed small objects is not satisfactory, especially for the HRRSIs with complex backgrounds.

Owing to their lightweight and efficient advantages, YOLO5 methods have garnered significant research attention, offering five versions, i.e, YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, which are different in model size. Considering the limited remote sensing images, we empirically select the small version of YOLOv5s to detect objects from HRRSIs. YOLOv5s primarily comprises three components: the backbone network, neck component, and detection head. The backbone network consists of the focus module that conducts slicing operations and the CSP module that splits the feature maps of base layers into two branches followed by the merging operation in a cross-stage hierarchy. The neck component makes use of spatial pyramid pooling fusion (SPPF), feature pyramid networks (FPN) [25], and path aggregation network (PAN) [26] to enhance feature representations via fusion. In addition, the CIoU loss function is utilized in multiple prediction heads to detect small objects. However, the YOLOv5s method may cause missing or duplicate detections for small and dense objects from complex HRRSIs. To address these issues, AMs are considered to enhance the feature representation ability by paying more attention to important information.

C. HRRSIs Object Detection With AM

AMs attract much interest in the fields of computer vision and remote sensing image processing. Wang et al. [27] introduce the nonlocal neural networks to learn long-distance dependencies in two-dimensional (2-D) images. Pang et al. [28] propose a remote sensing R-CNN object detection method by incorporating a global AM to efficiently extract features while suppressing the false detections. Woo et al. [29] design a novel attention module producing attention maps along spatial and channel dimensions. Zhu et al. [30] introduce a transformer prediction head (TPH) YOLOv5 object detection framework, incorporating a TPH and a convolutional block attention module (CBAM) to mitigate abrupt object size changes and achieve more precise object localization. Lu et al. [31] present an attention and feature fusion SSD to boost the semantic information of the shallow features and introduce a dual-path attention module to highlight key features. In contrast, Li et al. [32] propose a coattention unit to extract the groupwise attention for object cosegmentation. However, these AMs do not simultaneously consider long- and short-range contexts. Hou et al. [33] design a coordinate attention (CA) to divide the channel attention into two 1-D encoding processes, capturing long-range information in one direction and positional information in the other. Liu et al. [34] propose a YOLO-extract method integrating CA into the backbone network using a dilated convolution [35] after a pruning operation.

Despite the CA module could improve the object detection performance, the adaptive average pooling operation in the CA module loses the local feature information, decreasing detection accuracy for the densely distributed small objects in HRRSIs. To compensate for this local feature information, we introduce a residual structure and mix pooling instead of the existing average pooling operation into the original CA module, aiming to capture diverse local information in a module we name RCA. Based on this, we present an improved YOLOv5s method incorporating the proposed RCA module and designing a better loss function for densely distributed small object detection in complex remote sensing images. The proposed method not only improves object detection accuracy but also avoids increasing additional parameters.

SECTION III.

Proposed YOLOv5s Network

In order to accurately detect the small and dense objects from remote sensing images, we propose an improved YOLOv5s network. This section first outlines the overall architecture of the proposed method. Subsequently, it presents the optimized backbone network with RCA-CSP, the DE-based anchor box generation, the AW-IoU loss function, and the S-IoU soft-NMS method in succession.

A. Overall Structure

As depicted in Fig. 2, the proposed YOLOv5s network primarily comprises three components: the backbone network, the neck component, and the prediction head. The backbone network incorporates our refined RCA module into the CSP layer, enhancing the feature representation capacity to improve object detection accuracy. In addition, redundant residual blocks in the CSP layer are pruned to reduce parameters and capture detailed features, facilitating the detection of densely distributed small objects in remote sensing images. For the neck component, the SPPF, FPN, and PAN modules are retained in YOLOv5s to ensure effective feature fusion. In the prediction head, we designed a novel AW-IoU loss function by integrating the α-IoU and wise-IoU loss functions, replacing the original CIoU loss in YOLOv5s.

Fig. 2.

Architecture of the improved YOLOv5s network proposed in this article. The improvement in the backbone is marked by red.

Show All

B. Backbone Network With RCA-CSP

To improve the object detection accuracy in HRRSIs with complex backgrounds, AMs have been widely utilized to enhance the effective features and suppress the irrelevant ones. Several AM modules, such as squeeze and excitation (SE), CBAM, and pyramid split attention, have been proposed recently. However, these AMs usually consider the spatial or channel information while ignoring the significance of positional information. The CA module divides the channel attention into two 1-D encoding processes through coordinate information embedding and generation, thereby capturing positional information. Despite improving object detection performance, the adaptive average pooling operation in the CA module loses local feature information, reducing detection accuracy for the small and dense objects in HRRSIs. To enhance feature representation capacity, we replace the reweight operation in the original CA module with a residual structure, the new residual coordinate attention is named RCA and is depicted in Fig. 3. Furthermore, we substitute the existing average pooling operation in the improved RCA module with mix pooling to capture diverse local information.

Fig. 3.

Diagram of the proposed RCA module. A residual structure is adopted to replace the reweight operation in the final step of the original CA and a mix pooling is employed to substitute the existing average pooling (marked by red) to improve the feature extraction ability.

Show All

As for the computation process in the RCA module, the input feature initially undergoes a mix pooling operation along the X and Y axes, yielding two feature maps related to horizontal and vertical orientations. These two maps are then concatenated along the spatial dimension to form an integrated feature map via a 1×1 convolution and a nonlinear activation function. Subsequently, this integrated map is partitioned into two sections, each undergoing a 1×1 convolution to produce two attention feature maps corresponding to different orientation information. Finally, the two attention feature maps and the input features are multiplied in an elementwise manner, resulting in an attention feature map reflecting information from both orientations. This map is integrated with the input feature through a residual concatenation, leading to an enhanced output feature that includes attention information.

The proposed RCA module allows the object detection network to concentrate on larger receptive fields without significantly increasing computational overhead compared with other attention modules, such as SE or CBAM. It captures vital information, including cross channel, directional, and positional information. Incorporating the RCA module into the backbone network enhances its ability to accurately detect and locate the small and dense objects from complex HRRSIs. Hence, we incorporate the proposed RCA module into the final step of each CSP layer in the backbone network, now termed RCA-CSP. The structure of the new RCA-CSP layer is illustrated in Fig. 4.

Fig. 4.

Diagram of the improved RCA-CSP module. The proposed RCA module (marked by red) is integrated after the final ConModule to improve the feature extraction ability without extra trainable parameters.

Show All

Generally, the low-level features in CNN-based object detection methods possess high resolution and rich location information, while little semantic information. On the other hand, the resolution of high-level feature is low, resulting in inadequate location information but abundant semantic information. Given the huge number of small and dense objects in remote sensing images, the network layers cannot be excessively deep. In the YOLOv5s method, the CSP layer plays a critical role in enhancing the feature expression ability, which consists of both the main and auxiliary branches, with each branch performing a 1×1 convolution at the outset to decrease the channel numbers. In both branches, a 1×1 convolution is first operated to reduce the number of channels. In the main branch, the semantic features are extracted through a residual block, which is later concatenated with the auxiliary branch. During this process, the CSP layer facilitates the extraction and fusion of distinct features between the main and auxiliary branches while decreasing the number of parameters and computation burden of the network, consequently accelerating the network training speed. The number of residual blocks in the CSP layer affects the complexity of the network. In original YOLOv5s, the number of residual blocks in CSP is 3, 6, 9, and 3, respectively. Through our experiments, the number of residual blocks in the last two CSP layers has little boost on the network in detecting objects from HRRSIs. Therefore, the number of residual blocks in each CSP layer is reduced to 3, 6, 3, and 3, respectively. In other words, six residual blocks are pruned from the original third residual block, but there is no obvious decrease in object detection accuracy.

C. DE-Based Anchor Box Generation

The anchor box, generated by K-means in the original YOLOv5s network, achieves relatively high object detection accuracy on the COCO dataset but is not suited for remote sensing image due to its fixed size. However, the size of objects in HRRSIs varies greatly. In this article, the idea of the DE algorithm [36] is used to generate the optimized size of anchor boxes, which is a heuristic random search algorithm proposed by Store and Price for solving Chebyshev polynomials based on a theory of gene evolution. The DE algorithm exhibits strong robustness and requires minimal parameters, which operates through four steps, including initialization, mutation, crossover, and selection. Fig. 5 displays the flowchart of the DE algorithm. The traditional loss function, namely 1-IoU, was employed in our termination criteria. Once the DE algorithm achieves the predetermined threshold, it will terminate and generate an optimized anchor box size.

Fig. 5.

Flowchart of the DE algorithm. DE works to generate more anchor boxes in appropriate sizes.

Show All

An example of object detection results by YOLOv5s with anchor box generation method of K-means and DE algorithm from NWPU VHR-10 remote sensing images are shown in Fig. 6. The left image is the object detection result by the original YOLOv5s algorithm with K-means as anchor box generation method, while the right one is obtained by the proposed YOLOv5s algorithm with DE algorithm. It can be observed that the ground track field is missed by the original YOLOv5s algorithm, while the proposed algorithm with the DE method successfully detects this object. The reason mainly lies in that the target with a large size cannot be detected well by the original anchor box due to its fixed size, while the anchor box generated by the DE algorithm varies to a great extent and mitigates the problem of detecting objects from remote sensing images where the sizes of interested objects are with large differences.

Fig. 6.

Example of object detection results by YOLOv5s with anchor boxes generated by (a) K-means and (b) DE algorithms on NWPU VHR-10 dataset, respectively.

Show All

D. AW-IoU Loss

Bounding box regression (BBR) is a vital step in object detection for the accurate object localization. The success of our method heavily relies on the selection of an appropriate loss function in BBR. As we know, YOLOv5s is widely recognized as one of the most advanced techniques for object detection, leveraging the CIoU loss function for BBR. As an improved version of the distance intersection-over-union [37] loss function, CIoU penalizes the discrepancy of length-to-width ratios between the prediction box and the ground-truth box, as illustrated in the following equation:

$\begin{align*} {L}_{\text{CIoU}} &= 1 - \text{IoU} + \frac{{{\rho }^2\left({b,{b}^{gt}} \right)}}{{{c}^2}} + \left({\frac{v}{{\left({1 - \text{IoU}} \right) + v}}} \right)v \tag{1}\\ v &= \frac{4}{{{\pi }^2}}{\left({{\rm{arc }}\tan \frac{{{w}^{gt}}}{{{h}^{gt}}} - {\rm{arc\,tan}}\frac{w}{h}} \right)}^2 \tag{2} \end{align*}$ View Source

where the variables

$b$

and

${b}^{gt}$

are the predicted and ground-truth box center points, respectively. The meaning of other symbols can be referred to [37]. Since remote sensing images often contain challenging samples, CIoU fails to consider the balance between difficult and easy samples. Influenced by geometric factors, such as distance and aspect ratio, the penalty on hard samples can be exacerbated, reducing the generalization ability of the object detection method on other remote sensing images. A suitable loss function should minimize the impact of these geometric factors when the prediction box aligns closely with the ground-truth anchor box.

We propose to replace the original CIoU with the wise-IoU loss function [38], which effectively solves the aforementioned problem of overpenalty on hard samples. At the same time, we integrate the idea of α-IoU into wise-IoU (here it is named AW-IoU), which not only accelerates the convergence speed of the proposed YOLOv5s model but also provides stronger robustness to object detection in remote sensing images. For the wise-IoU loss, there are three versions, and the first one version of wise-IoU-v1 can be calculated by the following equations:

$\begin{align*} {L}_{\mathrm{WIoUv1}} =& {R}_{\text{WIoU}}{L}_{\text{IoU}} \tag{3}\\ {R}_{\text{WIoU}} =& \exp \left({\frac{{{{\left({x - {x}_{gt}} \right)}}^2 + {{\left({y - {y}_{gt}} \right)}}^2}}{{{{\left({W_g^2 + H_g^2} \right)}}^*}}} \right) \tag{4} \end{align*}$ View Source

where the variables

${W}_g$

and

${H}_g$

denote the width and height of the smallest enclosing box. Similarly, the width and height of the ground-truth box are represented by

${x}_{gt}$

and

${y}_{gt}$

, while those of prediction box by

$x$

and

$y$

, respectively.

As we know, the focal loss [39] function implements a monotonic focusing mechanism for cross entropy, thereby decreasing the weight attributed to simple samples in the calculation of the overall loss values. This facilitates the model in concentrating on more challenging samples and finally achieving better classification performance. Similarly, the wise-IoU-v2 can be formulated as (5) through the monotonic focusing coefficient of wise-IoU-v1

$\begin{equation*} {L}_{\mathrm{WIoUv2}} = {\left({\frac{{L_{\text{IoU}}^{\rm{*}}}}{{{L}_{\text{IoU}}}}} \right)}^\gamma {L}_{\mathrm{WIoUv1}}. \tag{5} \end{equation*}$ View Source

Furthermore, wise-IoU-v3 uses a dynamic nonmonotonic focusing mechanism, which defines outliers to describe the quality of the anchor box as follows:

$\begin{equation*} \beta = \frac{{L_{\text{IoU}}^{\rm{*}}}}{{{L}_{{\rm{\alpha IoU}}}}} \in \left[ {0, + \infty } \right). \tag{6} \end{equation*}$ View Source

A high object detection accuracy is achieved by YOLOv5s model as the degree of the anchor box outliers decreases. To make the BBR toward the regular-quality anchor boxes, a small gradient gain should be assigned to the one with a lower or high outlier degree, which can effectively prevent large gradients from hard samples. It utilizes a nonmonotonic focusing coefficient $\beta$ to modify the wise-IoU-v1 loss as follows:

$\begin{equation*} {L}_{\mathrm{WIoUv3}} = r{L}_{\mathrm{WIoUv1}},\quad r = \frac{\beta }{{\delta {\alpha }^{\beta - \delta }}}. \tag{7} \end{equation*}$ View Source

To accelerate the convergence speed of object detection model, we introduce the idea of α-IoU [40] to improve the performance of wise-IoU-v3. On this basis, we design the AW-IoU loss function, which makes the object detection algorithm focus more on regular-quality anchor boxes. Specifically, a small gradient gain is allocated to the anchor boxes with higher and lower quality. The AW-IoU loss proposed in this article can be formulated as follows:

$\begin{equation*} {L}_{\text{AWIoU}} = \exp \left({\frac{{{{\left({x - {x}_{gt}} \right)}}^2 + {{\left({y - {y}_{gt}} \right)}}^2}}{{{{\left({W_g^2 + H_g^2} \right)}}^*}}} \right)\gamma \left({1 - \text{Io}{\mathrm{U}}^t} \right) \tag{8} \end{equation*}$ View Source

where

$t$

plays the same role as

$\alpha$

$\alpha - \text{IOU}$

(here, the reason we term it as

$t$

is to differ from the

$\alpha$

$\gamma$

) and set to be 3. The meaning of

$\gamma$

is same as that of (7), where the parameter

$\alpha$

$\gamma$

is set to 1.9, and

$\delta$

set to 3. The computational graph detaches

${W}_g$

and

${H}_g$

(the detach operation is indicated by the superscript *).

The object detection results by the original YOLOv5s method with CIoU as a loss function are shown in Fig. 7(a), while the proposed method with AW-IoU as loss function is shown in Fig. 7(b). By comparison, the duplicate detection boxes of small and dense aircraft by the original YOLOv5s method are effectively avoided by the improved YOLOv5s method with AW-IoU as the loss function.

Fig. 7.

Instance of object detection results by YOLOv5s with (a) CIoU and (b) proposed AW-IoU as loss functions on NWPU VHR-10 dataset, respectively.

Show All

E. S-IoU-Based Soft-NMS

To remove the redundant prediction boxes, NMS is a frequently utilized postprocessing approach in object detection algorithms. But if the threshold in NMS is small, the bridge (as shown in Fig. 8) inside the green box is prone to be suppressed. Otherwise, it is easy to cause false positive detections, i.e., the suppression effect is not strict. As an improved version of NMS, soft-NMS [41] is designed to overcome this problem by modifying the score of traditional NMS algorithm in overlapping regions according to a reduced probability. Specifically, the steps of the soft-NMS algorithm are as follows.

Fig. 8.

Example of the limitations of the original NMS. The bridge inside the green box is prone to suppression for a smaller threshold in NMS.

Show All

Step 1. All the detected objects are sorted from the highest score to the lowest score.

Step 2. The object with the highest score is chosen, and the overlap between it and the remaining objects is calculated using IOU values. The weights of the remaining objects are adjusted accordingly.

Step 3. The confidence of the bounding box is reduced based on the adjusted weights, where the confidence is computed using the original score weighted by the updated weights.

Step 4. Repeat steps 2 and 3 for all the objects.

Due to the small and dense targets in remote sensing images, soft-NMS may result in detecting duplicate targets. To overcome this problem, we use the S-IoU [42] to combine with soft-NMS instead of the original IoU to suppress the duplicate prediction boxes. Compared with the traditional IoU method, the advantage of S-IoU is that it simultaneously considers the distance, shape, and angle of the prediction boxes. Consequently, S-IoU accurately represents the spatial relationship between prediction boxes and improves the detection accuracies for the small and dense objects in HRRSIs.

SECTION IV.

Experimental Results and Analyses

In this section, we implement a series of experiments to evaluate the performance of our proposed method. The experimental results on remote sensing images are compared with the state-of-the-art methods, both qualitatively and quantitatively, to verify the effectiveness and efficiency of our approach in detecting densely packed small objects. In addition, we conducted several ablation studies to verify the effect of each component within the proposed method.

A. Experimental Settings

We comprehensively compare with current object detection methods that are as similar to our proposed YOLOv5s method as possible in model size, including swin-transformer-tiny (ResNet-50) [43], faster R-CNN (ResNet-50) [8], SSD (ResNet-50) [10], YOLOv3 (DarkNet-53) [14], YOLOv5 (DarkNet-53) [30], and YOLOv7 (DarkNet-53) [44]. The methods are characterized by their architectures, with the first being a transformer-based approach and the remaining five being CNN based. Among these, faster R-CNN [8] is a two-stage CNN model, whereas the others are all single-stage object detection methods based on CNNs. Note that the parenthetical content indicates the backbone of each network. For fair comparisons, all methods are fine-tuned using the NWPU VHR-10 [1] dataset, which comprises 800 remote sensing images across 10 classes, each with a resolution of 640×640 pixels. The images are split into training, validation, and testing subsets at a ratio of 6:2:2.

Our proposed model was trained using the stochastic gradient descent method on a platform equipped with an AMD Ryzen 7 5800 H CPU and an NVIDIA GeForce RTX 3070 GPU (8 GB). The initial learning rate is 0.0025, and the weight decay is set to 0.0005 with the momentum of 0.937. The batch size and the number of epochs are set to 16 and 200, respectively. In addition, an early stopping criterion is adopted with a patience level of 20 to mitigate the overfitting problem. This means that after every two training cycles, we evaluate the mean average precision at 50% intersection over union (mAP50) of our method on the validation set. If there is no improvement in mAP50 on the validation set over 20 consecutive epochs, the training process will be stopped. This strategy is intended to enhance the generalization ability of our method on unseen data.

B. Evaluating Metric

The mAP metric is widely utilized for evaluating object detection methods, and its computation depends on intersection over union (IoU) values. Specifically, for IoU values of 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, and 0.95, mAP is denoted as mAP50, mAP55, mAP60, mAP65, mAP70, mAP75, mAP80, mAP85, mAP90, and mAP95, respectively. A generalized mAP is calculated as follows, determined starting from precision and recall.

Precision measures the prediction accuracy based on the percentage of true positives in all the predicted targets. It can be calculated using the following equation:

$\begin{equation*} P = \frac{{\text{TP}}}{{{\rm{TP + FP}}}}. \tag{9} \end{equation*}$ View Source

Recall measures the number of correct predictions that were detected in the samples. It can be calculated using the following equation:

$\begin{equation*} R = \frac{{\text{TP}}}{{{\rm{TP + FN}}}}. \tag{10} \end{equation*}$ View Source

The two equations above incorporate TP (True Positive), FN (False Negative), FP (False Positive), and TN (True Negative). Since both commonly used performance evaluation metrics of Precision and Recall are often contradictory, the $\text{mAP}$ indicator is introduced to evaluate the performance of the object detection method by using both parameters simultaneously. The calculation for $\text{AP}$ and $\text{mAP}$ is given by formula (11) and (12), respectively

$\begin{align*} \text{AP} =& \int_{0}^{1}{{PR\left(r \right)dr}} \tag{11}\\ \text{mAP} =& \frac{1}{N}\sum_{i = 1}^N {\mathrm{A}{\mathrm{P}}_i} \tag{12} \end{align*}$ View Source

where N is the number of testing images. In addition, we also use the average of mAP for IoU from 0.5 to 0.95 in evaluations, which is denoted by mAP@[0.5,0.95].

C. Qualitative Evaluation

1) Effectiveness of the Proposed YOLOv5s Method

The object detection results on remote sensing images, obtained with a confidence level of 0.65, are used to evaluate the effectiveness of the proposed method. Fig. 9 shows the (a) object detection results by the proposed YOLOv5s method, and (b) also those of the original YOLOv5s method for densely packed small objects. The image in the first row contains many dense storage tanks, which clearly shows that these tanks are accurately identified and located by our method, with no duplicate detections. In contrast, the original YOLOv5s method produces numerous duplicate and inaccurate prediction boxes. The image in the second row comprises densely distributed small aircraft. It is evident that the original YOLOv5s method results in numerous inaccurate prediction boxes and redundant detections, e.g., the aircraft at the left bottom of the image. In contrast, the object detection results by our method are more accurate without duplicate bounding boxes, successfully locating and identifying the aircraft. The superior performance primarily stems from the introduced S-IoU soft-NMS, effectively suppressing duplicate bounding boxes and the DE algorithm generating diverse sizes of anchor boxes, thereby facilitating the detection of objects with significant size differences in remote sensing images.

Fig. 9.

Effectiveness of the proposed YOLOv5s method to detect small and dense objects. (a) and (b) Object detection results by the proposed YOLOv5s method and the original YOLOv5s, respectively.

Show All

Fig. 10 validates the effectiveness of the proposed YOLOv5s method in more complex remote sensing images, in which the original YOLOv5s model tends to erroneously detect small objects as other categories. As shown in the right image of the first row, the overpass is erroneously classified as a bridge since the overpass structure is very similar to that in the scene. In contrast, such a problem does not occur in the proposed model. Similarly, the original YOLOv5s model makes a mistake since the watermark is recognized as a ship at the bottom of the second row, which highlights the challenges of detecting targets with blurred visual appearance. In contrast, the watermark is not deemed as the wrong object by the proposed YOLOv5s method because the CIoU commonly used in the original YOLOv5s allocates nearly average weights to all the bounding boxes, while the AW-IoU proposed in this article pays more attention to the regular bounding boxes. All the above object detection results demonstrate the effectiveness of the proposed YOLOv5s method in detecting small and dense remote sensing objects.

Fig. 10.

Comparison of object detection results on NWPU VHR-10 remote sensing images with complex backgrounds between (a) the proposed YOLOv5s method and (b) the original YOLOv5s network.

Show All

2) Comparison With State-of-the-Art Methods

To further verify the efficacy of the proposed YOLOv5s method, more experiments are conducted on the NWPU VHR-10 remote sensing dataset, and the object detection results are compared against the cutting-edge methods, as depicted in Fig. 11. In this figure, the first seven rows are the object detection results obtained by faster R-CNN [8], SSD [10], YOLOv3 [14], original YOLOv5s [30], swin-transformer-tiny [43], YOLOv7 [44], and the proposed YOLOv5 in this article, respectively, and the last row is the ground truth. As observed in the first column, the bare land in the lower right corner is erroneously recognized as a baseball field by faster R-CNN, SSD, YOLOv5, and swin-transformer-tiny, and such false detection does not occur in YOLOv3 and YOLOv7 possibly due to their weak feature extraction. It is noticed that the improved YOLOv5s algorithm avoids these false detections. Similarly, faster R-CNN and SSD occur false positive detections since they mistakenly deem the lawn as a baseball diamond in the second column, and YOLOv3 failed to detect some airplanes. In contrast, there are no miss or false detection for any objects by YOLOv5s, swin-transformer-tiny, YOLOv7, and our improved YOLOv5s method, which demonstrates that the proposed YOLOv5s obtains better object detection accuracy and performance. In the remote sensing images with complex backgrounds, the advantage of our improved method is also obvious. For example, there is a large shadow in the central area of the image in the third column. Although faster R-CNN and swin-transformer-tiny detect most cars, it has duplicate detection issues. Compared with YOLOV7, our method only missed one car that was partially located in the shadow on the left center of the image, while the other algorithms had more miss detections. Similarly, in the image of the last column with a more challenging scenario, there is a gray car in the shadow, which is very difficult to discern even with the naked eyes. The methods of SSD, YOLOv3, the original YOLOv5s, swin-transformer-tiny, and YOLOv7 fail to recognize this car in the shadow, but the two-stage method of faster R-CNN and the proposed YOLOv5s algorithm successfully detect and accurately locate the car since the proposed RCA module in our method improves the feature extraction ability in the shadow area with low contrast. Regarding object detection speed, the enhanced one-stage algorithm outperforms the two-stage method, signifying the superiority of our proposed method in handling the densely packed small objects in remote sensing images with complex backgrounds.

Fig. 11.

Comparison on the detection performance of small and dense objects in NWPU VHR-10 remote sensing images with complex backgrounds between the proposed YOLOv5s method and state-of-the-art approaches of faster R-CNN (two-stage), and SSD, YOLOv3, original YOLOv5s, swin-transformer-tiny, and YOLOv7 (one-stage), respectively. The last row is ground truth.

Show All

Fig. 11.

(Continued.)

Show All

In remote sensing images of sea scenarios with cluttered backgrounds, there are also interesting objects that present more challenges in object detection task. Fig. 12 provides some object detection results on such kinds of images and compares them with state-of-the-art methods. In the first column, there are nine ships in the image that are difficult to discern accurately by the methods of SSD and YOLOv3. The false detection results occur for the methods of original YOLOv5s, swin-transformer-tiny, and YOLOv7. In contrast, the two-stage method of faster R-CNN and the proposed YOLOv5s algorithm successfully detect and accurately locate all these ships. Similarly, the two ships in the upper left and bottom left of the second column are difficult to detect since the former ship is very small and located far from the image center, and the latter ship is partially obscured by clouds resulting in more difficult-to-discern. From the detection results, it can be observed that the methods of SSD, YOLOv3, and original YOLOv5s all fail to recognize both ships, while faster R-CNN, swin-transformer-tiny, YOLOv7, and the proposed algorithm successfully detect both ships. But there are many duplicate detections by swin-transformer-tiny method. In the third column, the petroleum pipelines are mistakenly detected as bridges by faster R-CNN, SSD, YOLOv3, original YOLOv5s, and swin-transformer-tiny, which leads to false positive detections. In addition, many dense small storage tanks and boats are missing. In contrast, there is no false detection of bridge by our proposed YOLOv5s method, which demonstrates that the proposed YOLOv5s obtains better object detection accuracy and performance. As a special case, there are no interested objects of ships in the wetland image of the last column, which presents sea scenarios due to its deep green colors. Some bare stones in this image are erroneously detected as ships by faster R-CNN, SSD, and YOLOv5, swin-transformer-tiny, and YOLOv7, and such false detection does not occur in YOLOv3 possibly due to its weak feature extraction. It is noticed that the improved YOLOv5s algorithm avoids these false detections.

Fig. 12.

Comparison on the detection performance of small and dense objects in NWPU VHR-10 remote sensing images with sea scenarios between the proposed YOLOv5s method and state-of-the-art approaches of faster R-CNN (two-stage), and SSD, YOLOv3, original YOLOv5s, swin-transformer-tiny, and YOLOv7 (one-stage), respectively. The last row is ground truth.

Show All

Fig. 12.

(Continued.)

Show All

D. Quantitative Evaluation

To quantitatively validate the effectiveness of the proposed YOLOv5s method, we first compare the object detection results between our method and the original YOLOv5s method in terms of AP values, and then between our method and state-of-the-art object detection methods in terms of mAP, mAP50, and mAP75, respectively.

We calculate the AP values for each category of object detection results in the NWPU VHR-10 dataset on the basis of the ground truth, and the results are shown in Table I. From this table, it can be observed that the AP values of the nine kinds of objects except for vehicle increase compared with the original YOLOv5s method. Since the airplane, ship, and storage tank are characterized as small and dense, there is a significant increment obtained by the proposed YOLOv5s algorithm in terms of AP values for which our method focuses more on densely distributed small objects. For the large and medium objects, e.g., baseball diamond and tennis court, etc., the AP values are also increased compared with the original YOLOv5s method. This shows that our method achieves better object detection results in most cases. It can also be observed that there is a slight decrease in vehicles, the reason possibly lies in which they are easily affected by the complex light circumstance and severe occlusion, which is still a challenging problem in the field of remote sensing object detection.

TABLE I Comparison Between the Proposed YOLOv5s Method and the Original YOLOv5s in Terms of AP for Object Detection of Each Category on the NWPU VHR-10 Remote Sensing Dataset

To further quantitatively validate the effectiveness of the proposed method, we calculate the three metrics of mAP, mAP50, and mAP75 for the object detection results from the NWPU VHR-10 dataset by state-of-the-art methods on the same platform, and the results are shown in Table II. It can be observed that the mAP and mAP75 values of the proposed method are consistently higher than those of the mainstream two-stage algorithm of faster R-CNN and one-stage algorithms of SSD, YOLOv3, swin-transformer-tiny, YOLOv5s, and YOLOv7. In addition, the mAP50 values of all the methods are above 90%, and that of our method ranks second place. As observed, the mAP50 value of our method is smaller than the highest value of YOLOv7 by a margin of 0.2%, demonstrating the excellent object detection performance of our method. In addition, it can also be observed that the gains of object detection accuracy obtained by our proposed method against the original YOLOv5s method are marginal when IoU is set to 50%. The reason mainly lies in that the localization errors tolerated by different object detection models under this threshold are larger, so there will not be an obvious improvement in object detection accuracy despite the fact that the YOLOv5s model is improved. However, our model exhibits a significant advantage at the IoU threshold of 75%. The reason mainly lies in that more precise object localization is needed at this threshold, and accordingly, our model has been optimized in terms of this aspect. These experimental comparisons indicate that our model has distinct advantages in the scenarios that require more precise localization and also validate the effectiveness of our method to improve the object localization accuracy.

TABLE II Comparisons Between the Proposed YOLOv5s Method and State-of-the-Art Object Detection Methods on the NWPU VHR-10 Remote Sensing Dataset in Terms of mAP, mAP50, and mAP75, Respectively

E. GFLOPs and Computational Time

All the compared methods are implemented on the same platform, as stated in Section IV-B. Table III reports the number of GFLOPs(B), computational time (ms), and parameters (M) for each model on the NWPU VHR-10 remote sensing dataset with an image size of 640*640 pixels. As demonstrated, our method runs in the cost of lower GFLOPs by comparing with swin-transformer-tiny, faster R-CNN, SSD, and YOLOv3. Furthermore, our GFLOPs are lower or comparable to those of YOLOv5s and YOLOv7. But noteworthily, the proposed approach achieves the gain of about 11% over the second-best method (YOLOv7) in mAP. As for the computation time, we can observe that our model runs fastest among the compared methods since the residual modules in our RCA-CSP layers are pruned and the improved RCA module does not introduce extra parameters. Finally, it can also be observed that our model has 7.07M training parameters, which are much smaller than faster R-CNN, SSD, YOLOv3, and swin-transformer-tiny, and are comparable with original YOLOv5s and YOLOv7 methods.

TABLE III Comparisons Between the Proposed YOLOv5s Method and State-of-the-Art Object Detection Methods on the NWPU VHR-10 Remote Sensing Dataset in Terms of GFLOPs, Computational Time(ms), and Parameters(M), Respectively

TABLE IV Model Ablation Analysis on the NWPU VHR-10 Remote Sensing Dataset in Terms of

$\text{mAP}$

$Table IV- Model Ablation Analysis on the NWPU VHR-10 Remote Sensing Dataset in Terms of $\text{mAP}$$

F. Ablation Study

To verify the effects of each component in the proposed YOLOv5s method, Table IV presents the ablation experimental results on the NWPU-VHR 10 dataset. For convenience, each medium model is termed as an abbreviated name in the first column. The original YOLOv5s method achieves 62.6 $\text{mAP}$ on the test set. On this basis, the mAP of the YOLOv5s-V1 method increases 1.1 by introducing DE to replace the original K-means algorithm, which enhances the detection ability for objects of various sizes. In addition, a new loss function of AW-IoU is designed and adopted in YOLOv5s-V2 to replace the commonly used CIoU in the original YOLOv5s method. This not only enables the network to focus more on common anchor boxes but also accelerates the convergence speed. We observe that the YOLOv5s-V2 method improves by 0.3 $\text{mAP}$ compared with YOLOv5s-V1. Furthermore, the YOLOv5s-V3 method is improved with an enhanced backbone network that boosts the feature extraction ability, which leads to an improvement of 0.2 $\text{mAP}$ versus the YOLOv5s-V2 method. Our final proposed model YOLOv5s-V4 utilizes S-IoU soft-NMS instead of NMS on the basis of YOLOv5-V3 to remove redundant anchor boxes of dense targets, resulting in a substantial improvement of 0.8 $\text{mAP}$ . Thus, our proposed model YOLOv5s-V4 outperforms the original YOLOv5s method by 2.5 $\text{mAP}$ in total on the NWPU VHR-10 dataset, demonstrating the effectiveness of the proposed method.

SECTION V.

Conclusion

In this article, we proposed a deep neural network to detect the small objects that are densely packed in remote sensing images with complex backgrounds. We optimized the backbone network by pruning the redundant residual blocks in the CSP layer from the original YOLOv5s network and integrating our proposed RCA-CSP module to enhance the feature representation capacity. The DE-based anchor box generation algorithm was adopted to replace the original K-means algorithm to produce various-sized anchor boxes. In addition, we designed a new loss function of AW-IoU to focus on most of the regular anchor boxes and accelerate the convergence speed of our method. Finally, the proposed method employs S-IoU soft-NMS instead of NMS to reduce the occurrence of duplicate object detections for small and densely packed objects in remote sensing images. The experimental results demonstrate that our method is effective for accurately detecting small objects within complex backgrounds, resulting in an improved detection accuracy compared with state-of-the-art algorithms. In future work, the data augmentation method could be optimized by incorporating prior knowledge of complex backgrounds to further enhance feature extraction capabilities in even more intricate environments.

References is not available for this document.

Improved YOLOv5s With Coordinate Attention for Small and Dense Object Detection From Optical Remote Sensing Images

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Work

A. HRRSIs Object Detection Via Deep Learning