An Efficient Method for Detecting Dense and Small Objects in UAV Images

Object detection in unmanned aerial vehicle (UAV) images is an important and challenging task for many applications, which often needs highly efficient detection algorithms to meet the accuracy and real-time requirements of the applications. In this article, we investigate efficient mechanisms for detecting dense and small objects in UAV images. Specifically, 1) kernel K-means is used to obtain optimal anchors for dense and small object detection; 2) a spatial information enhancement module is proposed to improve the detection accuracy of dense objects by extracting object spatial location information; 3) a Coord_C3 module is proposed to improve the receptive field of the network and to reduce the number of network parameters; and 4) a small detection head is added in the Head of the network and skip connections are employed in the Neck of the network to improve the detection accuracy of small objects. Experimental results on the VisDrone-2019, LEVIR-ship, and Stanford Drone datasets show that our method not only has higher detection accuracy but also runs faster compared to state-of-the-art detection methods.

An Efficient Method for Detecting Dense and Small Objects in UAV Images Chenyang Li , Student Member, IEEE, Suiping Zhou , Hang Yu , Member, IEEE, Tianxiang Guo , Yuru Guo , and Jichen Gao Abstract-Object detection in unmanned aerial vehicle (UAV) images is an important and challenging task for many applications, which often needs highly efficient detection algorithms to meet the accuracy and real-time requirements of the applications.In this article, we investigate efficient mechanisms for detecting dense and small objects in UAV images.Specifically, 1) kernel K-means is used to obtain optimal anchors for dense and small object detection; 2) a spatial information enhancement module is proposed to improve the detection accuracy of dense objects by extracting object spatial location information; 3) a Coord_C3 module is proposed to improve the receptive field of the network and to reduce the number of network parameters; and 4) a small detection head is added in the Head of the network and skip connections are employed in the Neck of the network to improve the detection accuracy of small objects.Experimental results on the VisDrone-2019, LEVIR-ship, and Stanford Drone datasets show that our method not only has higher detection accuracy but also runs faster compared to stateof-the-art detection methods.

I. INTRODUCTION
W ITH the development of the unmanned aerial vehicle (UAV) technology, object detection in UAV images has a wide range of applications, including urban environment monitoring [1], land utilization planning [2], forest fire monitoring [3], traffic management [4], and military [5].Images captured under a UAV's field of view are more complex than images of natural scenes.Specifically, 1) UAV images are variable and complex, and object distribution may be dense or sparse.Existing object methods often demonstrate low robustness in this context and 2) as the UAVs are usually far from the ground, the objects in the captured images may be small, which makes it hard to extract the real contours of the objects with the existing object detection methods.Thus, detecting these dense and small objects in UAV images is an important and challenging task.
Many methods have been proposed for detecting small objects in UAV images [6], [7], [8], [9], [10], [11], [12], [13], [14].The authors in [8] enhanced the accuracy of small object detection through alignment fusion of shallow spatial features and deep semantic features, employing candidate region feature alignment.Liu et al. [11] improved the detection accuracy of small object in UAV images by connecting two ResNet units of equal width and height based on YOLOv3.Two data enhancement strategies and distance metrics are proposed in [12] for improving detection accuracy of small objects.Zhang et al. [13] proposed a spatial logical aggregation network (SLA-NET) with morphological transformations, which enables the extraction of fine-grained features of small objects through multiple plugand-play dynamic fusion modules.The authors in [14] proposed a multibranch parallel feature pyramid network (MPFPN) and focuses attention on object information, which improves the detection accuracy of small objects.However, the computational cost of these methods is generally large and can hardly meet the real-time requirements of many UAV applications.In addition, the detection accuracy of these methods may be decreased when the objects are densely distributed.Some methods have been proposed for detecting dense objects in UAV images.Xu et al. [15] proposed an advanced foregroundenhanced attention Swin transformer (FEA-Swin) framework that integrates contextual information into the original backbone of the Swin transformer.To avoid losing the information of small objects, an improved weighted bidirectional feature pyramid network (BiFPN) is proposed.To balance the detection accuracy and efficiency, an efficient bidirectional feature pyramid network neck is introduced.A novel semantic embedding density adaptive network (SDANet) was proposed in [16], which designs a new density matching algorithm to obtain each object by partitioning the clustering proposal and performing hierarchical and recursive matching of the corresponding centers.Ye et al. [17] proposed a backbone network utilizing involution and selfattention, capable of extracting effective features from complex objects.Furthermore, they introduced a multiscale feature fusion module to address the issue of large number of small objects in UAV images through multiscale object detection and feature fusion.However, due to the large number of parameters involved, these methods are difficult to meet the real-time requirements of many UAV applications.
In addition, many methods have been proposed for real-time object detection in UAV images [18], [19], [20], [21], [22].Zhang et al. [18] achieved real-time object detection implementation for UAVs by introducing channel-level sparsity in the convolutional layer.This was accomplished through the application of L1 regularization to the channel scaling factor and the removal of less informative feature channels using clipping techniques.The authors in [21] achieved real-time object detection by replacing standard convolution with depth-separable convolution based on YOLOv3.A convolutional multihead selfattention (CMHSA) based on efficient convolutional transformer block (ECTB) was proposed in [22] for achieving real-time object detection.CMHSA employs a convolutional projection instead of a positional-linear projection, which reduces computational cost.These methods help to reduce the computational cost for object detection in UAV images, they often results in low detection accuracy, particularly for dense and small objects.
To address these problems, we investigate efficient mechanisms (DS-YOLOv5s) for detecting dense and small objects in UAV images with high accuracy and low computational cost.Fig. 1 summarizes the performance of our method as compared with the state-of-the-art methods.
The major contributions of our work are as follows.1) A kernel K-means is used to obtain optimal anchors for dense and small object detection.2) A spatial information enhancement (SIE) module is proposed to improve of detection accuracy of dense objects by extracting object spatial location information.3) A Coord_C3 module is proposed to improve the receptive field of the network and to reduce the number of network parameters.4) A small detection head is added in the Head of the network and skip connections are employed in the Neck of the network to improve the detection accuracy of small objects.The rest of this article is organized as follows: Section II describes the related work.Section III provides a detailed description of the proposed method for dense and small object detection.The experimental results are presented and analyzed in Section IV.Further analysis is discussed in Section V. Finally, Section VI conclueds this article.

II. RELATED WORK
As our method is based on YOLOv5s, in this section, we will first introduce the principles of YOLOv5s, then describe some related work in feature extraction for object detection in UAV images.

A. YOLOv5s
YOLOv5 [23] is the fifth generation of You Only Look Once (YOLO) [24], a state-of-the-art object detection network.YOLOv5 includes YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, which are classified mainly according to their size and computational complexity.Compared to other networks, YOLOv5s has the advantages of fewer parameters and faster detection speed.The structure of YOLOv5s includes the Input, Backbone, Neck, and Head, as shown in Fig. 2.
Input: During the initial stage of processing images, the input image is adjusted to conform to the size of the model's input through normalization and adaptive scaling.The Mosaic data augmentation [25] and adaptive anchor [26] methods are used to improve the inference speed of the network, and enhance the robustness of the network.
Backbone: The Backbone of YOLOv5s is the improved CSP-Darknet53 network [27], which combines of CBS, C3, and SPPF [28] modules for refined feature information extraction.CSPDarknet53 effectively enhances the learning capability of convolutional neural networks (CNNs) while simultaneously reducing computational cost.
Neck: It serves the purpose of connecting the Backbone network with the prediction head network, facilitating the acquisition and transmission of feature information.It consists of two networks.The feature pyramid network (FPN) [29] has an up-down structure that upsamples and fuses the underlying feature information to obtain the predicted feature map, while the path aggregation network (PANet) [30] uses a down-up structure to fuse the FPN feature map to complement the FPN structure .
Head: The YOLOv5s contains three object detection heads which correspond to three different sizes of feature maps.Each grid on the feature map is predefined with three anchors of different aspect ratios, which is used to store anchor-based position and classification information in the feature map channel dimension for object prediction and regression.The prediction frame is calibrated by CIoU loss [31], and the optimal prediction frame is obtained by nonmaximum suppression (NMS) [32].
In this study, YOLOv5s will serve as the basis as well as the benchmark.We made various enhancement to YOLOv5s to address the accuracy and real-time requirements for detecting dense and small objects in UAV images.

B. Feature Extraction for Object Detection
Traditional convolutional networks use an up-down substructure, where the expressiveness of the object's shallow features Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.decreases as the convolutional layers become deeper.As the semantic information of smaller objects often appears in shallower feature mappings, deeper convolutional layers may result in the loss or complete disappearance of object feature information.Therefore, common methods for solving this problem include feature fusion, receptive field enhancement, and anchor matching.
Feature fusion is widely used in object detection.Jin et al. [33] combined the semantic information of feature context at different scales with an expansive convolution method.Han et al. [34] used deconvolution to enhance the feature representation of ship objects.Wang et al. [35] designed SSS-YOLO to fuse feature and semantic information by a path-enhanced fusion network.A two-way convolution network (TWCNet) is proposed in [36] to process both shallow and deep feature information.The authors in [37] proposed a graph feature-enhanced selective assignment network (GSANet) that uses graph convolutional networks to obtain topological information between ground objects to enhance representational features.For receptive field enhancement, Zhao et al. [38] grew candidate regions from multiple receptive fields and combined the contextual information of the candidate frames to improve the detection accuracy of the object.Dai et al. [39] fused down-up and up-down feature maps to enhance the receptive fields of small object features.To expand the receptive field of the convolution kernel, Wang et al. [40] introduced large kernel convolution by replacing the small convolution kernel with two parallel rectangular convolution kernels.Expanding the receptive field while maintaining the ability to capture local detailed features improves the object detection accuracy.As to anchor matching, Fu et al. [41] adopted an anchor-free strategy to detect small ships in SAR images.Xu et al. [42] used an improved K-means++ algorithm to optimize the anchors and to alleviate the difficulty in optimizing multiscale features of ships.Liang et al. [43] proposed a concise analytical geometry algorithm to calculate ship orientation and gradually refine the keypoints to establish an accurate orientated bounding box.

III. DS-YOLOV5S
Due to the fact that UAVs fly at high altitudes, this results in a high proportion of small objects in the image, which are densely distributed.In addition, it is often difficult to balance between the high computational demand and the limited arithmetic power of low-power chips of UAVs.To address these problems, we propose a method for detecting dense and small objects in UAV images based on YOLOv5s.First, the anchor of the dataset is optimized by kernel K-means clustering algorithm.Then, we introduce an SIE module in the Backbone and Neck of the network, which enhances the location information of dense objects in the network by extracting spatial feature information.At the Backbone and Neck network, based on CoordConv [44], the Coord_C3 module is proposed to replace the C3 of YOLOv5s, which can improve the receptive field of the network and reduce the number of parameters of the network.In addition, at the Head of the network, a detection head for a small object is added to improve the detection accuracy of small objects.Finally, skip connections are introduced to fuse the shallow and deep features, which can improve the feature sensing ability of the network and further improve detection accuracy of dense and small objects.The DS-YOLOv5s network structure is shown in Fig. 3.

A. Anchor Optimization
YOLOv5s defines three initial anchors, and optimizes the anchor using the K-means clustering algorithm [45].However, K-means requires manual initialization of the clustering center, which results in low clustering accuracy.Existing algorithms use K-means++ [46] to avoid manual initialization of the cluster centers, but result in higher computational complexity.
Dhillon et al. [47] proposed kernel K-means based on Kmeans.Unlike traditional K-means, kernel K-means uses kernel functions to map data into high-dimensional space before clustering.This algorithm can effectively process a nonlinear distribution datasets and improve clustering accuracy.
Ensuring reasonable anchor is a crucial requirement for improving detection accuracy of objects.The UAV object detection dataset usually contains multiple categories of objects, each of which has a different size and with a nonlinear  distribution.To obtain optimal anchors, this study introduces kernel K-means clustering at the input of the DS-YOLOv5s network to automatically find the reasonable anchor for object detection.Fig. 4 illustrates an example where the clustering center is set to 5 and the number of iteration is set to 10. Conduct experiments on simulated data, with the Rand index serving as the evaluation metric.The clustering process using K-means, K-means++, and kernel K-means took 0.18, 0.21, and 0.12 s, respectively.Notably, kernel K-means achieved the highest Rand index, indicating superior performance in terms of speed and accuracy.Consequently, this study utilizes kernel K-means for anchor frame optimization.To the best of our knowledge, this is the first application of kernel K-means in anchor optimization.

B. SIE Module
An SIE module is proposed to extract spatial location features in UAV images, weighting different channel features and spatial locations, and enhance the network's perception and positioning ability of object categories and location distributions.The SIE module structure is shown in Fig. 5.The location information of the feature maps is extracted through CoordConv [44], and the feature expression capability of the network is enhanced through the CoordCBS module.To enhance computational efficiency and improve feature representation, the feature maps is compressed using 1 × 1 convolution (Conv) in the channel dimension.This process effectively removes redundant channel features and reduces the number of parameters.Subsequently, the feature maps are concatenated in both height and width directions employing maximum pooling and global average pooling.The SIE module is expressed as the following equation: I 2 = ca(MP 5,9,13 (c(I 1 )), c(I 1 ), AP (c(I 1 ))) ) where c, cc, ca, AP, MP, and i denote Conv, CoordConv, Concat, AvgPool, MaxPool, and Input, respectively.

C. Coord_C3 Module
To improve receptive field and reduce the number of parameters of the network, we proposed the Coord_C3 module with spatial information based on CoordConv, as shown in Fig. 6.

D. Loss Function
EfficiCLoss is a loss function utilized in object detection, which combines the advantages of class balanced importance sampling (CBIS) loss and focal loss [48].Its purpose is to improve the detection accuracy of small objects and expedite the convergence of the network.Compared to DIoU [31] and SIoU [49], EfficiCLoss exhibits significant effectiveness in enhancing the accuracy of model detection for small objects, and accelerating training.Consequently, we employ EfficiCLoss as the loss function of our network.The equations of EfficiCLoss are as follows: where L effici is the EfficiCLoss loss, L cbis is the CBIS loss, L focal is the focal loss, α is a balance coefficient, L ce is CE loss, p t is the predicted probability, and n is a balance factor.In this article, α and n are 0.6 and 0.5, respectively.

IV. EXPERIMENTS AND ANALYSIS
To evaluate the performance of the proposed method, this study uses the VisDrone-2019 [50], LEVIR-ship [51], and Stanford Drone [52] datasets and compares the proposed method with existing object detection methods.All experiments are conducted with the same hardware and software environment, whose configurations are shown in Table I.Our network is implemented using PyTorch on a Windows 11 operating system, with an Nvidia GeForce RTX 1660S GPU and CUDA 11.3.The stochastic gradient descent [53] is employed as the optimizer, with an input image size of 512×512 pixels.The network is trained for 300 epochs, including 50 epochs of freeze training, with a batch size of 8. Subsequently, unfreeze training is conducted with a batch size of 4. To enhance the training process, a Mosaic [25] data augmentation strategy is employed.The initial learning rate is set to 0.01, with a minimum learning rate of 0.0001, which is adaptively adjusted based on the dataset characteristics.The momentum parameter and weight decay are set to 0.937 and 0.0005, respectively.During the testing stage, a postprocessing step utilizing NMS is applied.

A. Datasets 1) VisDrone-2019:
The VisDrone2019 dataset was compiled by the AISKYEYE team at the Lab of Machine Learning and Data Mining, Tianjin University, China.This benchmark dataset comprises 288 video clips, with 261 908 frames and 10 209 static images.These recordings were captured using various drone-mounted cameras, offering a comprehensive representation of different aspects, including location (spanning 14 different cities across China, separated by thousands of kilometers), environment (urban and rural settings), objects (pedestrians, vehicles, bicycles, etc.), and scene density (ranging from sparse to crowded scenes).It is important to note that the dataset was collected using diverse drone platforms, with varying models, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.across different scenarios, and under various weather and lighting conditions.Manual annotation was carried out on these frames, resulting in over 2.6 million bounding boxes encompassing objects of interest, such as pedestrians, cars, bicycles, and tricycles.The distribution of features within the dataset is shown in Fig. 7.We employed this dataset to evaluate a method's capability in detecting dense objects.
2) Levir-Ship: The LEVIR-ship dataset comprises images captured by multispectral cameras on the Gaofen-1 and Gaofen-6 satellites.These images have a spatial resolution of 16 m and utilize only the R, G, and B bands.A total of 85 scenes were collected, with pixel resolutions ranging from 10 000×10 000 to 50 000×20 000.The original images were cropped to generate 1973 positive samples and 1923 negative samples, all of size 512×512 pixels.Detecting ships within this dataset poses challenges due to their relatively small size compared to the vast background.LEVIR-ship is a widely used dataset for object detection in remote sensing images, and its data distribution closely resembles that of UAV images.The distribution of features within the dataset is shown in Fig. 8.We utilized this dataset to evaluate a method's capability in detecting small objects.
3) Stanford Drone: The Stanford Drone dataset is an outdoor UAV dataset collected by the Computational Vision and Geometry Lab in the Department of Computer Science at Stanford University containing images and videos of various types of targets (not only pedestrians but also bicycles, skateboards, cars, buses, and golf carts).The dataset collects trajectory interaction information of 20 k objects in eight different scenes using UAVs in an overhead view during crowded time periods on campus, and each object track is labeled with a unique ID suitable for object trajectory prediction and multiobject tracking.The number of videos in each scene and the percentage of each agent in each scene are shown in Fig. 9.We utilized this dataset to evaluate a method's capability in detecting dense and small objects.

C. Evaluation Metrics
As primary accuracy evaluation metrics, standard average precision (AP), mean average precision (mAP), precision (P), and recall (R) are widely used in object detection for both natural and remote sensing images.To assess a model's efficiency, we measure frames per second (FPS), parameters (Params), and floating-point operations (FLOPs).
In the context of a model or classifier, the metric P represents the ability to accurately predict positive samples, with a higher value indicating superior performance.On the other hand, R represents the proportion of predicted positive samples relative to the total number of samples, and its performance aligns with that of P. It is worth noting that P and R have a mutual influence on each other.Generally, when P is high, R tends to be low, and vice versa.The metrics P, and R are computed using the following equations: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II ABLATION EXPERIMENTS ON VISDRONE-2019 (%)
where TP denotes correctly recognized objects in the image, FP signifies incorrectly recognized objects, and FN indicates objects that were correctly recognized but assigned to an incorrect category.
AP refers to the area under the P-R curve, while the mAP represents the average value of AP for each category.Specifically, mAP@0.5 denotes the average value of AP when the intersection over union (IoU) threshold is set to 0.5.On the other hand, mAP@0.5:0.95indicates the average mAP across different IoU thresholds (0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95).The metrics AP and mAP are computed using the following equations: ) where C denotes the number of object categories, K represents the current IoU threshold, and P (K) and R(K) denote precision and recall of the current IoU threshold, respectively.FLOPs serve as a metric for quantifying the computational complexity of a model and are frequently employed as an indirect indicator of the speed of a neural network model.On the other hand, Params represents the number of model parameters.In addition, FPS provides a measure of the efficiency of a model.

D. Dense Object Detection 1) Ablation Studies:
To assess the effectiveness of the DS-YOLOv5s in enhancing dense object detection performance in UAV images, a series of ablation experiments are conducted on the VisDrone-2019 dataset and the results are shown in Table II.The findings reveal that the introduction of SIE on top of YOLOv5s resulted in a 0.3% increase in mAP@0.5 and a 0.8% increase in mAP@0.5:0.95,effectively improving both feature extraction capability and detection accuracy.In addition, the incorporation of Coord_C3 based on YOLOv5s yielded notable improvements across mAP@0.5 and mAP@0.5:0.95.Specifically, there was a 2% increase in mAP@0.5:0.95,significantly enhancing the accuracy of dense object detection.Similarly, the inclusion of EfficiCLoss in YOLOv5s led to enhancements in mAP@0.5 and mAP@0.5:0.95,indicating improved detection accuracy.DS-YOLOv5s signifies the adoption of a comprehensive approach that incorporates all three proposed improvement methods.The results unequivocally demonstrate the superior detection accuracy of the proposed method for dense objects in UAV images. 2

) Comparison With State-of-the-Art Methods:
The training process was conducted for a total of 300 epochs, and the corresponding loss curve is illustrated in Fig. 10.In this figure, Box_loss represents the discrepancy between the predicted and actual bounding boxes.Class_loss indicates the classification loss, which determines the model's ability to accurately recognize objects in the image and assign them to the correct categories.Object_loss represents the confidence loss, which supervises the presence of objects within the grid and calculates the network's confidence level.As shown in Fig. 10, it can be observed that after 300 epochs, the loss values of the DS-YOLOv5s network ceased to decrease, indicating that the network had converged and stabilized.
DS-YOLOv5s was compared with other object detection methods on the VidDrone-2019 dataset.As shown in Table III, on mAP@0.5,DS-YOLOv5 improves upon Faster RCNN by 36.3%,Mask RCNN by 16.8%, Cascade RCNN by 4.1%, Cor-nerNet by 22.5%, CenterNet by 1.75%, SSD by 30.4%,YOLOv2 by 28%, YOLOv3 by 4%, YOLOV4 by 5.7%, YOLOv5s by 3.1%, EfficientDet by 2.6%, SGMFNet by 2.8%, YOLOv7-sea by 2%, and RTD-Net by 3.5%.On Params, YOLOV5s has the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.smallest number of parameters and our method is second only to YOLOv5s.On standard deviation, DS-YOLOv5s has the smallest standard deviation value of AP, which demonstrates the stability of the method.Due to the limitations of UAVs hardware, small platforms are more sensitive to the number of parameters and model volume.DS-YOLOv5s effectively reduces the number of parameters and volume of the network by using the idea of CoordConv and replacing the traditional convolution module of YOLOv5s with the CoordConv module.It can be seen from Table III and Fig. 11 that DS-YOLOv5s greatly reduces the hardware requirements of the network structure while guaranteeing accuracy, which is conducive to its use in small devices, such as UAVs.
3) Analysis of Visualization Results: To more intuitively demonstrate the dense object detection capability of DS-YOLOv5s in UAV images, we selected some experimental results, as shown in Fig. 13, which demonstrate that the proposed method can accurately determine the position of an object in dense objects.This is a challenging scenario, as the algorithm can easily misidentify these as a single object, or miss some of them.DS-YOLOv5s can effectively detect each object and   VI, compared to the YOLOv5s, DS-YOLOv5s exhibits improvements of 1.8% in mAP@0.5 and 9% in mAP@0.5:0.95.These findings validate the effectiveness of the proposed module.
2) Comparison With State-of-the-Art Methods: Following 300 epochs of training, a test was conducted using sample data from the test dataset.The area under the P-R curve serves as a measure of average accuracy, with a larger area indicating higher accuracy.As shown in Fig. 14, the P-R curve of DS-YOLOv5s exhibits a larger area compared to that of YOLOv5s, indicating superior performance.
As shown in the Table VIII and Fig. 16, the accuracy on Stanford Drone is higher than on the LEVIR-ship dataset compared to Table VII.In terms of mAP@0.5:0.3) Analysis of Visualization Results: To assess the generalization capability of DS-YOLOv5s in detecting small objects, we conducted comparative experiments using the LEVIR-ship dataset.As shown in Fig. 17, DS-YOLOv5s is able to accurate detection the smallest objects almost without missed or false detection.

F. Computational Complexity
Table IX presents the FPS achieved by each object detection method on the VisDrone-2019, LEVIR-ship, and Stanford Drone datasets using the same experimental platform.The results reveal that Faster-RCNN exhibits the lowest FPS due to its two-stage network architecture.This approach involves extracting the object region first, followed by CNN classification and identification, which yields higher detection accuracy but can hardly meet the real-time requirement.On the other hand, YOLOv5s achieves the highest FPS but with lower accuracy.In terms of  The results demonstrate that DS-YOLOv5s has higher detection accuracy and faster detection speed, which shows that DS-YOLOv5s is a promising object detection method for dense and small objects in UAV applications.

V. DISCUSSION
With the popularity and development of UAV technology, UAVs are widely used in military and civilian applications.However, due to the high flight altitude and large field of view of UAVs, the images captured from the UAV contain both dense and small objects, which reduces the accuracy of object detection  by UAVs. it is crucial to develop a method that can simultaneously detect dense and small objects in UAV images.
Existing UAV object detection methods are mainly objectspecific and have low accuracy for detecting dense and small objects in images.In addition, existing methods suffer from large number of parameters and low computational efficiency, which make them difficult to be deployed on UAV computing platforms and to perform real-time object detection.Our proposed method is able to achieve real-time object detection while guaranteeing high-accuracy detection of dense and small objects.First, a kernel K-mean clustering algorithm is used to optimize the anchors of the dataset.Then, SIE module is introduced in the backbone and neck of the network to enhance the location information of dense objects in the network by extracting spatial feature information.In the backbone network, based on CoordConv, the Coord_C3 module is proposed to replace the C3 module of YOLOv5s, which improves the acceptance domain of the network and reduces the number of parameters of the network.In addition, in the head of the network, a detection head for small objects is added to improve the detection accuracy of small objects.Finally, the introduction of jump connection fusion of shallow and deep features improves the feature sensing ability of the network, which further improves the detection accuracy of dense small objects.
It should be noted that, the proposed model is still large to be deployed on UAVs and the detection performance also needs to improve considering the complexity in real applications.More specifically: 1) proposing a UAV image object detection method under hazy weather to improve robustness in complex environments; 2) proposing a dynamic object detection method for UAV images to achieve real-time object tracking; 3) improving the computational resources of UAV computing platforms.

VI. CONCLUSION
This article focuses on improving the detection accuracy of dense and small objects in UAV images by incorporating feature fusion and spatial information.To this end, we have proposed DS-YOLOv5s.First, we introduce the kernel K-means algorithm at the Input of the network, enabling rapid determination of the optimal anchor size.To effectively extract spatial information from the feature map, an SIE module is proposed.Furthermore, Coord_C3 module is introduced to extend the range of feature awareness and to reduce the model size.In addition, skip connections are employed to fuse shallow strong semantic information with deep weak semantic information, thereby enhancing the network's receptive field.Finally, we incorporate a small object detection head into the network architecture to improve the detection of small objects.Experimental results demonstrate that DS-YOLOv5s surpasses existing state-of-the-art methods in terms of both detection accuracy and FPS.
As future work, we plan to evaluate our method with more challenging datasets and design a lightweight network to reduce the size of the network.

Fig. 1 .
Fig. 1.FPS and AP comparison of object detection methods on the VisDrone-2019.

Fig. 7 .
Fig. 7. Distribution of objects in VisDrone-2019.(a) Distribution of dataset categories.(b) Distribution of object sizes in the dataset.

Fig. 8 .
Fig. 8. Distribution of objects in LEVIR-ship dataset.(a) Images of LEVIRship dataset.(b) Distribution of object sizes in the dataset.

Fig. 17 .
Fig. 17.Detection results of small objects on LEVIR-ship (red rectangular boxes indicate detection of small objects).

TABLE IV COMPARISON
OF OBJECT DETECTION PERFORMANCE OF VARIOUS CATEGORIES ON VISDRONE-2019(MAP@0.5(%))bus, and motor The experimental results show that DS-YOLOv5s can significantly improve the detection accuracy of the network for dense objects.In addition, to further demonstrate the effectiveness of the proposed method, we conducted a comparative test with stateof-the-art methods on Stanford Drone dataset.As shown in TableV, on AP, DS-YOLOv5 improves upon Faster RCNN by 11.9%, Mask RCNN by 7.3%, Cascade RCNN by 3.6%, CornerNet by 13%, CenterNet by 5.9%, SSD by 9.3%, YOLOv2 by 6.5%, YOLOv3 by 14.1%, YOLOV4 by 12.4%, YOLOv5s by 11.4%, EfficientDet by 7.9%, SGMFNet by 8.6%, YOLOv7-sea by 8.1%, and RTD-Net by 9%.It can be seen from TableVand Fig.12that DS-YOLOv5s greatly reduces the hardware

TABLE V COMPARISON
RESULTS OF DIFFERENT METHODS ON STANFORD DRONE

TABLE VI ABLATION
EXPERIMENTS ON LEVIR-SHIP TABLE VII COMPARISON OF ACCURACY OF DIFFERENT METHODS ON LEVIR-SHIP

TABLE VIII COMPARISON
OF ACCURACY OF DIFFERENT METHODS ON STANFORD DRONE FOR SMALL OBJECTS

TABLE IX COMPARISON
OF FPS ON VISDRONE-2019, LEVIR-SHIP AND STANFORD DRONE DATASET