Consumer-Centric Insights Into Resilient Small Object Detection: SCIoU Loss and Recursive Transformer Network

As an emerging consumer electronic product, the use of unmanned aerial vehicle(UAV) for a variety of tasks has received growing attention and favor in the enterprise or individual consumer electronics market in recent years. The deep neural network based object detectors are convenient to embed into the UAV product, however, the drone-captured images could bring the potential challenges of object occlusion, large scale difference and complex background to these methods because they are not designed for the detection of small and tiny objects within the aerial images. To address the problem, we propose an improved YOLO paradigm called SR-YOLO with an Efficient Neck, Shape CIoU and Recursion Bottleneck Transformer for better object detection performance in consumer-level UAV products. Firstly, an efficient neck structure is presented to retain richer features through a small object detection layer and an up-sampling operator suitable for small object detection. Secondly, we design a new prediction box loss function called shape complete-IoU(SCIoU), which utilizes a width (height) limiting factor to alleviate the deficiency that the CIoU only focuses on aspect ratios by taking into account both the aspect ratio and the ratio of the two boxes’ widths. Moreover, combined with recurrent neural network and multi-head self-attention mechanism at the cyclic manner, a recursive bottleneck transformer is constructed to relieve the impact of highly dense scene and occlusion problems exists in UAV images. We conduct the extensive experiments on two public datasets of VisDrone2019 and TinyPerson, where the results show that the proposed model surpasses the compared YOLO by 8.1% and 3.2% in $mAP_{50}$ respectively. In addition, the analysis and case study also validate our SR-YOLO’s superiority and effectiveness.

Consumer-Centric Insights Into Resilient Small Object Detection: SCIoU Loss and Recursive Transformer Network Le Wang, Yu Shi, Guojun Mao, Member, IEEE, Fayaz Ali Dharejo , Senior Member, IEEE, Sajid Javed , and Moath Alathbah Abstract-As an emerging consumer electronic product, the use of unmanned aerial vehicle(UAV) for a variety of tasks has received growing attention and favor in the enterprise or individual consumer electronics market in recent years.The deep neural network based object detectors are convenient to embed into the UAV product, however, the drone-captured images could bring the potential challenges of object occlusion, large scale difference and complex background to these methods because they are not designed for the detection of small and tiny objects within the aerial images.To address the problem, we propose an improved YOLO paradigm called SR-YOLO with an Efficient Neck, Shape CIoU and Recursion Bottleneck Transformer for better object detection performance in consumer-level UAV products.Firstly, an efficient neck structure is presented to retain richer features through a small object detection layer and an up-sampling operator suitable for small object detection.Secondly, we design a new prediction box loss function called shape complete-IoU(SCIoU), which utilizes a width (height) limiting factor to alleviate the deficiency that the CIoU only focuses on aspect ratios by taking into account both the aspect ratio and the ratio of the two boxes' widths.Moreover, combined with recurrent neural network and multi-head self-attention mechanism at the cyclic manner, a recursive bottleneck transformer is constructed to relieve the impact of highly dense scene and occlusion problems exists in UAV images.We conduct the extensive experiments on two public datasets of VisDrone2019 and TinyPerson, where the results show that the proposed model surpasses the compared YOLO by 8.1% and 3.2% in mAP 50 respectively.In addition, the analysis and case study also validate our SR-YOLO's superiority and effectiveness.Index Terms-Unmanned aerial vehicle (UAV) image, small object detection, you only look once, Bottleneck transformer.

I. INTRODUCTION
U NMANNED aerial vehicle (UAV) with object detec- tion has permeated through various fields of consumer electronics [1], [2], [3], such as industrial inspection [4], environmental monitoring of aquaculture [5], agricultural crop monitoring [6], or civilian mountain search and rescue [7].Although the UAV aerial images are of high resolution, they still suffer from dramatic changes in their scale, intricate background and occlusion.Currently, YOLO [8], [9], [10], [11] series algorithms are gaining importance in object detection algorithms.YOLOv5 with high adaptability, easy deployment, and high accuracy is widely used in various target detection tasks and industrial production.However, the YOLO series algorithm, conceived for object detection in general natural scenes (typical natural scene datasets include MS COCO [12], PASCAL VOC [13], etc.).In order to keep YOLOv5 lightweight while better adapting it to UAVs for object detection.We improved YOLOv5 for the characteristics of the images in the actual UAV working scenarios.So that it can give consumers a better experience.
For the above reasons, we propose an improved efficient neck, a newly designed Shape CIoU and Recursion Bottleneck Transformer for small object detection in dronecaptured images.Firstly, considering that small or tiny objects possess low pixels [14], [15] and there exists dramatic scale variation between different objects in aerial images, a new neck structure called Efficient Neck by integrating Content-Aware ReAssembly of Features [16] (CARAFE) up-sampling operator and small object detection layer.Efficient Neck can better cope with the drastic changes in different object scales and obtain better detection results, such as detecting more tiny objects.Secondly, we design a novel prediction box loss function Shape Complete-IoU (SCIoU) to make the prediction box closer to the real box.Improvement of complete-IoU (CIoU) [17] aspect ratio penalty term failure by a newly designed shape penalty term.In addition, an improved Recursion Bottleneck Transformer (RBoT) module is proposed to obtain global information through recursively utilizing the self-attention mechanism to alleviate the issue of object occlusion in crowds or highly-density scenes.
By incorporating their advanced components above, an innovative YOLO paradigm is derived to address the specific challenges associated with small even tiny object detection through enhancing the spatial resolution of the feature maps, refining the aspect ratio loss and obtaining global contextual signals.Specially, for the validation of our proposed method, we implement the two variants using our proposed SR-YOLO, namely SR-YOLOv5 and SR-YOLOv8 (based on YOLOv5 and YOLOv8, respectively), and conduct the extensive experiments.We validate them on two benchmark datasets (VisDrone2019 and TinyPerson).The experimental results show that SR-YOLOv5 surpasses 8.1% on mAP 50 (%), and the new model is more suitable for small object detection tasks than the original YOLOv5.Experiments have proved that SR-YOLO is able to recognize targets more accurately in a variety of complex backgrounds and poor lighting environments.Combined with the UAV Intelligent System, SR-YOLO can provide better intelligent identification services for electronics consumers.In summary, the main innovations of this paper are as follows: A new neck network comprised of CARAFE, an additional feature fusion layer, and a detection head is realized, where CARAFE provides a large receptive field and will be flexible for up-sampling.The four-detection head structure can better alleviate the impact of object scale changes.
A new prediction box loss function is designed, which makes the prediction box closer to the real box while improving the width ratio between the prediction box and the real box.The new loss function enables the network to obtain higherquality prediction boxes and maximize detection accuracy.
An improved recursion Transformer is proposed to enhance the recognition effect in occlusion and dense scenes.This new attention mechanism can help the network better obtain target context information in complex situations.

II. RELATED WORK
Object Detection Models: Generally, an image is taken as input and output with bounding boxes and labels on detected objects by object detection models, which thus can be mainly divided into two classes from the perspective of their identifying processes.The first type of models are two-stage detectors, in which the detection is separated into two phases, namely generating a large number of region proposal that may contain the object and append approximate location information, then classifying and regressing the RoIs, such as R-CNN [18], Fast R-CNN [19], Faster R-CNN [20], SPPNet [21], Mask R-CNN [22], FPN [23].
Bounding Box Regression Loss Functions: In object detection, the positioning task needs to determine the position of the object in an image and then output its corresponding coordinate information.The positioning task relies on the bounding box regression module to locate the object.The bounding box regression refers to using a rectangular bounding box to predict the position of the object in an image, and then continuously adjust the position of the prediction box.The bounding box regression loss function has been developed rapidly with the introduction of IoU (intersection over union).GIoU [32] is proposed to solve the problem that IoU cannot optimize the non-intersection between the prediction box and the real box, aiming to make the prediction box and the real box intersected.Then DIoU and CIoU [17] further solved the degradation problem when the prediction and real boxes overlapped.
Consumer Electronic: The development of technology that combined drones and deep learning in the field of crop classification was investigated by Bouguettaya et al. [33], who pointed out the importance of the integration of UAV-based remote sensing technologies and deep learning algorithms, especially object detectors, which provide great convenience to the consumers in agriculture.Huang et al. [34] proposed an algorithm based on YOLO-R to deal with the actual situation in road object detection for autonomous driving.Gung et al. [35] improved the YOLOv4 for pavement defect detection, aiding drivers to get more information about the road surface to reduce car damage and safety risk.In addition, Chen et al. [36] applied the defect detector in the enterprise production lines, where their model can effectively assist the workers in identifying the defective products.
YOLO Series: YOLO series models are widely used in object detection because of its high accuracy and detection speed.The TPH-YOLO model proposed by Zhu et al. [37] was in the light of the characteristics of aerial images.Based on YOLOv5, a Transformer layer was added in front of the detection head to form a Transformer Prediction Head (TPH) to replace the prediction head in the original model.Convolutional Block Attention Module (CBAM) was employed to find the area of attention in dense object scenes, but this structure requires huge computational costs and hardware resources.Rahman et al. [38] trained four different models on the aerial view of urban traffic and then used non-maximum suppression (NMS) for the integration of the models.Although this method improves the accuracy, its reasoning time increases greatly.Kim et al. [39] proposed ECAP-YOLO, introduced an improved IECA module into the SPP module and network, and modified the number of detection layers to make the model more suitable for detecting small objects.Because the model deletes the detection layer of large objects, the detection effect of large objects is relatively poor.Liu et al. [40] used YOLOv4 for sea surface object detection.They designed a new reverse depth separable convolution to replace the partial convolutions in the network, which improved the network a little in terms of accuracy, but if all the convolutions in the network are replaced, the effect is not ideal.Junos et al. [41] utilized the improved YOLOv5 network to detect the oil palm image of the drone.This study added a tightly connected neural network, modified the activation function to switch, and added a new detection layer, resulting in the superior detection effect.By using the proposed object detection architecture SR-YOLO, it can demonstrate the promising results of recognizing and classifying both normal and small even tiny objects in aerial images.

III. METHOD
In order to enable the conventional object detector to address object occlusion, large scale difference and complex background simultaneously when recognizing and classifying the small even tiny objects in aerial images, we propose a novel SR-YOLO model through the three elaborated components, namely an Efficient Neck, Shape CIoU and Recursion Bottleneck Transformer and implement its two variants (SR-YOLOv5 and SR-YOLOv8), which have been testified the detection performance in our experiments so that the consumer-level UAVs equipped with our method could bring better object detection experience to the consumers.The detail of the three proposed components is described as follows.

A. Efficient Neck
There are plenty of small (objects smaller than 32*32 pixels are defined in COCO [12] as small objects) and tiny (objects smaller than 20*20 pixels are defined as tiny objects in TinyPerson [42]) objects in aerial images.Meanwhile, objects from different classes could be of dramatic scale variance.However, the YOLO's neck network, for instance, the nearestneighbor up-sampling used by YOLOv5's Neck prevents the effective propagation of high-level semantic information.YOLOv5's three-prediction head structure is only suitable for medium and large-sized object detection tasks.To fill these gaps, we propose a neck network structure composed of content-aware reassembly of features and small object path aggregation network neck to improve detection of tiny objects.This structure includes a small object detection layer and a CARAFE up-sampling.
On the one hand, the vanilla YOLOv5 uses 8, 16, and 32 times down-sampling, and the feature map obtained is hard to preserve the informative feature of small objects.Therefore, a small object detection layer is employed at four times down-sampling to predict a larger resolution image, and an additional detection head is added to alleviate the problem of drastic scale changes.The nearest neighbor up-sampling employed by YOLOv5 is simple and efficient, but it only uses the gray value closest to the sampling point.As a result, it produces discontinuous gray values after sampling, resulting in low image quality.Using this approach is difficult for deep networks to learn the features of smaller objects, which affects the model's detection accuracy.For this issue, we considered replacing the up-sampling component with one more suitable for aerial images.
Consider that CARAFE reassembles the feature information from the lower-resolution input feature map to generate a higher-resolution output feature map, allowing for more effective up-sampling for objects with small pixel.Through combining with CARAFE, small object detection layer could incorporate context information from surrounding regions to enhance the representation of small objects.Thus we employ this content-aware up-sampling operator and a small object detection layer to construct the Efficient Neck, as shown in Figure 1.Compared with the original Neck using nearest neighbor up-sampling, the Efficient Neck part with  CARAFE has a larger up-sampling kernel, which allows it to summarize contextual information over a large receptive field.CARAFE performs content-aware processing for specific instances, generating adaptive kernels on-the-fly.The upsampling kernel and the feature map are semantically related such that CARAFE can be based on the input content.As a result, CARAFE better preserves feature information for small objects.Thus Efficient Neck can better utilize the surrounding and semantic information of the feature map.Hence, additional feature fusion layer is added and prediction head into Efficient Neck.As opposed to the three-prediction head in the vanilla model, a four-prediction head can better handle the drastic scale changes of different objects.The Efficient Neck has a significant impact on improving detection in scenarios that involve small objects whose scale changes drastically, resulting in missed and false detections.We verify this in the ablation experiments in Section IV.

B. Shape CIoU
The loss function for predicted box regression in object detection tasks generally consists of the classification loss and the bounding box regression loss.For bounding box regression loss, the most commonly used CIoU introduces two considerations based on the GIOU.The ratio between ρ and c (as shown in Figure 2) and the penalty/consistency term for the aspect ratio of the real box and the predicted box.ρ is the Euclidean distance between the centre b of the real box and the center b gt of the predicted box.The dotted rectangle is the minimum bounding box containing two boxes, and c is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.1)-( 3), where V is for calculating the aspect ratio between the predicted box and the ground truth box, and w gt and h gt represent the width and height of the real box, respectively.α is the balance factor of V.
However, as Figure 3 shows, when the aspect ratio of the two sets of real boxes and the predicted box is equal, the aspect ratio will fail.The model's learning capability will weaken by this loss function.To address the above problem we design a SCIoU loss function as shown in equation (4).V s is a newly designed penalty with ratio-curbing factor (X), as shown in equation (5).The ratio-curbing factor changes the shortcomings of CIoU that only focuses on the aspect ratio.With the addition of the ratio-curbing factor, SCIoU is able to focus on both the aspect ratio and the ratio of the widths of the two boxes.SCIoU allows the model to be optimized in terms of both the aspect ratio and the ratio of the two widths in order to avoid the degradation of the loss function as much as possible.
Here the ratio-curbing factor is denoted as X, as shown in equation (6).When the value of X is lower, the predicted box is closer to the real box, the ratio of the width/height of the real box to the width/height of the predicted box tends to 1. Since CIoU has a combined effect on the aspect ratio, a similar effect can still be produced if the width w is replaced by h.Based on the above V s and X, the new balance factor can be expressed in a new form, as shown in equation (7).

C. Recursion Bottleneck Transformer
In UAV images, it is prevalent that a number of objects are in close proximity to each other, thus from different per-spectives it seems overlapped between objects and forms severe occlusions.On the other hand, image quality change caused by non-uniform environmental illumination is unavoidable during drone operations, which would increase the rate of missed and false detection.The convolution structure utilized in YOLOv5 extracts limited local signals instead of contextual information.Global features can help the model to accurately localize in the presence of dense backgrounds and heavy occlusions in aerial images.Therefore global feature information is very important for small object detection.Transformer's dynamic attention mechanism, global modeling capability has a strong feature learning ability.Transformer can extract rich global feature information.DETR, the first object detection model based on transformer, greatly simplifies the object detection process in an end-to-end fashion.Srinivas et al. [43] replaced the convolution operator in the last residual block of ResNet with a multi-head selfattention (MHSA) module, allowing the network model to improve accuracy while reducing the number of parameters.Shen et al. [44] uses the recursive structure in recursive neural network to construct an iterative transformer block and applies it to the DEiT model.Their experimental results show that this simple recursive operation is capable of enhancing feature representation.
Inspired by the above research work, we propose a Recursive Bottleneck Transformer (RBoT), which has two variants of RBoT-a and RBoT-b, as shown in Figure 4.To be specific, RBoT-a is constructed with multiple MHSA layers in a manner of recursion based on the BoT??Alternatively, RBoT-b furnishes each subsequent MHSA layer with an additional residual connection operation, as shown in equation ( 8) and ( 9).Based on the pilot experimental results analysis, we only use one cycle (namely, two MHSA layers) when implementing this proposed module.Although RBoTb is more complex than RBoT-a, the ablation experiment in Section IV-C shows that it has not yet produced more parameters and the detection accuracy is slightly improved.
M (X) a and M (X) b represent the outputs of RBoT-a and RBoT-b, respectively.MSHA means Multi-Head Self-Attention mechanism, and f (1×1) is a convolution operation with a convolution kernel size of 1x1.The MHSA layer structure is shown in Figure 5, where four heads are used in RBoT, and only one is shown here for simplicity.The representation of the Self-Attention Layer is consistent with that in BoT.
It's worth noting that the RBoT module can be flexibly applied to the multiple versions of YOLO, thus actually we replace the bottleneck module in C3 with the RBoT module in both YOLOv5 and YOLOv8 for practice, where a new RBoTC3 module is represented.The module obtains rich contextualized information and enhances the ability to capture differential local signals because it has strong global modeling and feature learning capabilities.RBoT makes the model more effective in complex background and partially occluded object detection.Compared to simultaneously replacing multiple C3 modules with RBoTC3, we only substitute the last C3 in the backbone network because the former way will significantly increase the computational cost, while the latter one can already retain the convolution translation invariance and local feature extraction ability of the model.

IV. EXPERIMENT AND ANALYSIS
To investigate the effectiveness of the model, we conduct the extensive experiments on the two classic and public UAV image datasets, VisDrone2019 [45] and Tiny Person [42].

A. Datasets and Experimental Setup
The VisDrone2019 dataset [45], released by Tianjin University, consists of images taken by various drones under changing weather, diverse scenes and different attitudes.These drone-captured images abound in small/tiny objects with complex environments and serious occlusion problems.The dataset covers ten predefined categories: pedestrians, people, cars, vans, buses, trucks, motorcycles, bicycles, and awning tricycles.
The TinyPerson dataset [42] is the first benchmark dataset for person detection with long-distance backgrounds released by the University of Chinese Academy of Sciences in 2019.It contains two types of person in the sea and on land (sea person and earth person).Low resolution (less than 20 pixels) and great variation of aspect ratio for the person are two main characteristics in this dataset, where the pose and angle of the person is variable, and some of the images have more than 200 tasks.
All the experiments in this paper were carried out on the A100 server with 80 G Video storage and Pytorch 1.12.1.In the training phase, the number of training rounds is set to 150, the batch size is 16, and the rest are the same as the default parameters of YOLOv5.
In order to accurately evaluate the performance of the algorithm, we select Average Precision (AP), Mean Average Precision (mAP) and Recall as the evaluation indicators of this experiment.AP can reflect the detection performance of a single object category.Mean Average Precision measures the comprehensive detection performance of all categories.

B. Experimental Result
In this section, we choosed YOLOv5s and YOLOv8n as our baseline model and show the comparison results between SR-YOLO and the other state-of-the-art models on the two datasets.Specifically, SR-YOLO is compared with RetinaNet [46], CenterNet [47], CornerNet [26], YOLOv4, YOLOv5s, YOLOv5m, YOLOx-s, the lastest YOLOv7tiny [42] and the currently advanced YOLOv8 [48] on metrics mAP 50 , mAP 50:95 and recall.From Table I we can find that SR-YOLO obtains higher mAP 50 , mAP 50:95 and Recall than all the other models without adding too many parameters.For example, compared to YOLOv5m, SR-YOLO has only 36.4% of its parameters, but mAP 50:95 are increased by 8.14%.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II RESULTS OF DIFFERENT MODELS ON TINYPERSON
SR-YOLO had 32% fewer parameters than YOLOv8s and was 2% higher at mAP50:95.It can be seen the superiority of SR-YOLO over the other algorithms on UAV aerial images since it can obtain better detection results with a lower number of parameters.Furthermore, SR-YOLO even surpass YOLOv8s while reducing its number of parameters by 32.1%.In addition, to further verify the effectiveness of the proposed model, YOLOv3-tiny, YOLOv3, YOLOx-s, YOLOv5m were selected for comparison and validation on the TinyPerson dataset, on which the experimental results in Table II show that SR-YOLO outperforms YOLOv5s by 3.2%, 1.01%, and 3.5% in terms of mAP 50 , mAP 50:95 and recall, respectively In summary, both of validations on the two benchmarks testify that our SR-YOLO outperforms the YOLOv5s model in all the metrics, proving the effectiveness of the proposed three components.SR-YOLOv8n similarly outperforms the YOLOv8n model in all metrics, demonstrating the generalizability of the proposed components.

C. Ablation Experiment
To comprehensively verify the effectiveness of Efficient Neck, SCIoU, and RBoT, we conducted a series of ablation studies and use YOLOv5s as the benchmark model (and denote SR-YOLOv5 as SR-YOLO for simplicity in the following description).As shown in Table III, "E.", "S.", "R." represent our Efficient Neck, Shape CIoU, and RBoT, respectively.We can observe that individually adding S. and E. to the baseline(YOLOv5s) implementation improves its performance by 0.6 and 4.8 in terms of mAP50.It is expected to witness the efficient neck component archives the most gain because of E's small object detection layer as well as upsampling are designed to cue small object detection rates.
Based on this module, S. and R. are added separately, it shows that the mAP 50:95 of the model increased by 6.1% and 7.2%.Finally, after combining all the innovative parts above, the SR-YOLO is 8.1%, 5.4%, and 6.7% higher than the benchmark method in terms of mAP 50 , mAP 50:95 and recall, respectively.The comprehensive studied result demonstrates the effectiveness of the proposed model, which can significantly improve the detection accuracy compared to YOLOv5s with only 0.5728M increase in the number of parameters.
To observe the specific accuracy of SR-YOLO on each class of the objects, it shows the specific detection effect for them in Table IV.From the table it can be seen that it gets the best detection with mAP 50 of 56.3% on the category of Car, followed by Bus with mAP 50 of 42.8%, while objects with very small size such as Bicycle and Awning-tricycle have poorer detection, only gaining mAP 50 of 15.9% and 13.5%.In the following experiment, we performed more detailed ablation experiments for each of the three components: Efficient Neck: Firstly, we summed up the changes in parameters and accuracy after adding a small object detection layer (P2) and CARAFE on YOLOv5.After integrating the P2 layer, the parameter amount increases by 0.68 M, and the accuracy is greatly improved.As shown in Table V, YOLOv5 with Efficient Neck is 4.8% higher than the benchmark method on mAP50.
SCIoU: To visually demonstrate the effect of SCIoU, we selected DIoU, CIoU and SCIoU to conduct comparative experiments on the model after adding Efficient Neck.From Table we can see that the SCIoU improves 0.5% and 0.3% in mAP50 compared with DIoU and CIoU, respectively.
In addition, to demonstrate the role of the control ratio factor in SCIoU, we compared the status under the value changes of CIoU and SCIoU.Figure 6 shows that the proposed SCIoU model has a lower loss value and converges faster than the same structure with CIoU.The final result of SCIoU gains only half of the Loss value of CIoU (i.e., 0.044469).
RBoT: Finally, to compare the effect of the recursion Multi-Head Self-Attention module RBoT and the BoT module, we conducted our experiments under the four settings, namely adding no BoT(-BoT), original BoT (+BoT), RBoT-a and RBoT-b together with Efficient Neck and SCIoU.From Table VI it can be seen that mAP 50 increased by 0.5%, and the parameter decreased by 0.11 M after adding BoT.Instead, our proposed RBoT has a lower number of parameters than the original BoT, reducing the model by 0.34 M. Meanwhile, the model equipped with both RBoT-a and RBoT-b is 1.6% higher than with BoT on mAP 50 , and 0.8%, 1.1% higher on mAP 50:95 , respectively.Recall also improved significantly.Overall, the performance of RBoT-b is slightly better than RBoT-a.

D. Case Study
Consider that it may encounter complicated situations when a UAV is working in real environment, we have selected several representative sets of UAV images from both VisDrone   and TinyPerson for case study of intricate conditions.To analyze the above experimental results, this section shows the effect of the proposed model on the example images of VisDrone and TinyPerson datasets.VisDrone: As shown in Figure 7, the left side is the YOLOv5 benchmark method, and the right side is the SR-YOLO algorithm, where the white box is the area with a more obvious contrast effect.Group (a) shows the most common images using low altitude electronic devices such as UAVs at transportation hubs.The targets in these scenarios are overlapped and occluded.Group (b) selected scenes with high contrast and complex backgrounds to test the model's ability to resist interference.Group (c) shows a scene of traffic monitoring using UAV equipment at night.This type of scene is darker and vehicles are blurred on the image due to high speed movement.(d) An image showing a situation where a vehicle is obscured by a tree while the UAV is checking street parking.In group (a), it is evident that the detection effect of YOLOv5 fails to detect many partially obscured cars and cyclists.In contrast, the improved model detects more obscured objects than the original model, and the boxes for detecting people on cars is accurate from the whole twowheeler to people on cars.Group(b) shows that in the case of backlight, building shadow, and complex background, the original model can only two targets can be detected, while SR-YOLOv5 correctly detected four objects in such scenarios.In group (c), the complex background and the blurred vehicles at night with high speed are selected, in which the original model identifies the white objects on the roof as vehicles and the moving cars as trucks, while the model in this paper can correctly identify the category of the objects; in group (d), the original model does not detect the vehicles waiting for red lights in the distance and the small two-wheelers in the middle of the road, but SR-YOLO can see these two categories more accurately.From the investigation of the various cases that objects are hard to distinguish, we can see that our proposed model have been tested in different ambient lights, including backlight, low light and drop shadows of the building, indicating the robustness of our SR-YOLO under poor lighting conditions.Meanwhile, the model also shows its excellent performance in traffic intersections when there are lots of person and vehicles.These comparisons show that the improved model is more suitable for UAV aerial image detection.
TinyPerson: The comparison results of the YOLOv5 model and SR-YOLO detection are shown in Figure 8, with YOLOv5 on the left and the improved algorithm on the right.We selected (a) (b) (c) three sets of images using a drone Here we select the sub-figures (a), (b) and (c) to examine the distribution of people on a beach with tiny characters and a high amount of useless information in the background.In these three sets of images can be seen that the original YOLOv5 can barely detect tiny and dense people in the distance.In the group (d) of Fig. 6 we have selected the noise images with abundant scale-varying ships images where there is a lot of background noise (ships) for comparison.From the sub-figure, it can be seen that the proposed method can detect more person compared to the benchmark.Meanwhile, the original YOLOv5 recognizes a large number of sea person as earth person, instead, SR-YOLO accurately recognizes the type of people.In contrast, SR-YOLO can accurately detect more people and objects compared to the original model.(The experiments show that this model is better than the YOLOv5 model for tiny and dense object images.

V. CONCLUSION
In this paper, we have presented a novel small object detection model SR-YOLO, which consists of an efficient neck, a newly designed prediction box loss function and a recursive bottleneck transformer, exploring the ability to tackle the object occlusion, light variation and intricate background in UAV images.It suggests a new YOLO paradigm for utilizing in electronic products like drones because it can be flexibly integrated into different versions of YOLO.Moreover, the proposed method is efficient due to its number of parameters and accuracy.Our study shows that designing and training lightweight modules of object detectors for application of electronics is an interesting direction which deserves broader focused researches.This work only make an attempt in exploration, and could be further improved in more aspects: how to combine the attention mechanism and feature fusion to capture better detailed and semantic features, and how to further improve the real-time by efficient fusion strategy as well as keep the accuracy for a lighter deployment in consumer electronics.
Sajid Javed received the B.Sc. degree in computer science from the University of Hertfordshire, U.K., in 2010, and the combined master's and Ph.D. degree in computer science from Kyungpook National University, Republic of Korea, in 2017.He is a Faculty Member with Khalifa University (KU), UAE.Prior to that, he was a Research Fellow with KU from 2019 to 2021, and the University of Warwick, U.K., from 2017 to 2018.His research interests include visual object tracking in the wild, multiobject tracking, background-foreground modeling from video sequences, moving object detection from complex scenes, and cancer image analytics, including tissue phenotyping, nucleus detection, and nucleus classification problems.His research themes involve developing deep neural networks, subspace learning models, and graph neural networks.
Moath Alathbah received the Ph.D. degree from Cardiff University, U.K.He is currently an Assistant Professor with King Saud University, Saudi Arabia.His research interests include the development of photoelectronic, integrated electronic active and passive discrete devices, the design, fabrication, and characterization of MMIC, RF, and THz components, smart antennas, microstrip antennas, microwave filters, meta-materials, 5G antennas, and MIMO antennas miniaturized multiband antennas/wideband and microwave/millimeter components using micro and nano technology.

Fig. 3 .
Fig. 3. Green boxes represent real boxes, and two red boxes represent predicted boxes of different sizes.minimum closure diagonal.The equation of CIoU is shown in equation (1)-(3), where V is for calculating the aspect ratio between the predicted box and the ground truth box, and w gt and h gt represent the width and height of the real box, respectively.α is the balance factor of V.

TABLE I RESULTS
OF DIFFERENT MODELS ON VISDRONE

TABLE III THE
EFFECTS OF THE YOLOV5 COMBINING DIFFERENT MODULES ON THE VISDRONE

TABLE IV RECOGNITION
EFFECT OF EACH CLASS ON SR-YOLO TABLE V PRECISION COMPARISON OF DIOU, CIOU AND SCIOU

TABLE VI COMPARISON
OF BOT, RBOT-A, RBOT-B PARAMETER QUANTITIES AND ACCURACY