Remote Sensing Image Detection Based on YOLOv4 Improvements

Remote sensing image target object detection and recognition are widely used both in military and civil fields. There are many models proposed for this purpose, but their effectiveness on target object detection in remote sensing images is not ideal due to the influence of climate conditions, obstacles and confusing objects presented in images, image clarity, and associated problems with small-target and multi-target detection and recognition. Therefore, how to accurately detect target objects in images is an urgent problem to be solved. To this end, a novel model, called YOLOv4_CE, is proposed in this paper, based on the classical YOLOv4 model with added improvements, resulting from replacing the backbone feature-extraction CSPDarknet53 network with a ConvNeXt-S network, replacing the Complete Intersection over Union (CIoU) loss with the Efficient Intersection over Union (EIoU) loss, and adding a coordinate attention mechanism to YOLOv4, as to improve its remote sensing image detection capabilities. The results, obtained through experiments conducted on two open data sets, demonstrate that the proposed YOLOv4_CE model outperforms, in this regard, both the original YOLOv4 model and four other state-of-the-art models, namely Faster R-CNN, Gliding Vertex, Oriented R-CNN, and EfficientDet, in terms of the mean average precision (mAP) and F1 score, by achieving respective values of 95.03% and 0.933 on the NWPU VHR-10 data set, and 95.89% and 0.937 on the RSOD data set.

The objective of this paper is to come up with a novel 96 model, called YOLOv4_CE, based on YOLOv4 improve-97 ments, as to achieve better remote sensing image detection 98 performance. The main contributions of the paper are the 99 following: 100 1) Replacing the feature extraction backbone (CSP 101 Darknet53) of YOLOv4 with ConvNeXt-S [17] in 102 order to make the model extract features more effec-103 tively and by this to lessen the computation of redun-104 dant information at the feature layer and reduce the 105 model size; 106 2) Integrating the coordinate attention (CA) mechanism 107 [18] into YOLOv4, so as to increase the receptive field 108 and allow the model to pay more attention to important 109 parts of the processed images; 110 3) Replacing the Complete Intersection over Union 111 (CIoU) loss [19] with the Efficient Intersection over 112 Union (EIoU) loss [20] in the loss function of YOLOv4 113 as to achieve faster convergence and improve the 114 regression precision; Attention mechanisms were first proposed and used for 123 natural language processing (NLP) and text alignment in 124 machine translation. In the field of computer vision, attention 125 mechanisms are used to improve the performance of the 126 utilized neural networks. The existing attention mechanisms 127 include Squeeze-and-Excitation (SE) [21], Convolutional 128 Block Attention Module (CBAM) [22], Coordinate Attention 129 (CA) [18], etc. SE is used to solve the loss problem caused 130 by the diverse importance of different channels of the fea-131 ture map during the convolution pooling but it ignores the 132 importance of positional information. Considering the short-133 comings of SE, CBAM integrates two attention mechanisms, 134 namely channel attention and spatial attention. By reducing 135 the number of channels and using a large-scale convolution 136 for the utilization of location information, CBAM can not 137 only reduce the number of parameters and save comput-138 ing power, but also can be integrated seamlessly into any 139 CNN architecture. However, convolutions can only capture 140 local relations and fail in modeling long-range dependen-141 cies which are essential for computer vision tasks, [18]. 142 CA effectively integrates spatial coordinate information into 143 the generated attention graph by embedding positional infor-144 mation into the channel attention in order to reduce the 145 loss caused by the 2D global pooling and decomposes 146 the channel attention into two parallel 1D feature encod-147 ings, resulting in a significant gain for intensive prediction 148 tasks.

B. MULTI-SCALE FEATURE INTEGRATION
In the field of target object detection, integrating the features 151 of different scales is a vital task to improve the performance 152 of target objects distinguishing from the image background.

153
The resolution of high-level features is low, and the percep-154 tion of details is poor, but the semantic information is rich.

155
On the contrary, the resolution of low-level features is high, 156 and the details and location information are rich, but the 157 semantic information is poor. is shown in Figure 1, where C i (i = 2, 3, 4, 5) represents the      in trying to achieve better accuracy and efficiency [8]. These 242 models are presented in the following subsections. and their locations on it. For this, instead of repurposing 249 classifiers to perform detection, it frames object detection as 250 a single regression problem to spatially separated bounding 251 boxes and associated class probabilities, which are predicted 252 by a single CNN directly from the entire image in one step. 253 YOLO trains on full images and directly optimizes its perfor-254 mance for object detection.

255
Among the different YOLO versions, the Darknet-based 256 version 4 (YOLOv4) is the most accurate YOLO version, 257 especially if a computer-vision engineer is in pursuit of state-258 of-the-art results and can perform additional customization 259 on the model [28]. That is why YOLOv4 was selected as 260 a basis for the elaborated model, proposed in this paper, 261 and as the main YOLO representative for the performance 262 comparison of models performed.

263
The YOLOv4 structure is shown in Figure 5. [19] is used by the loss function to further consider the 272 aspect ratio, overlapping area, and center distance between 273 the prediction frame and target frame. The CBM module 274 is composed of convolution (Conv), Batch normalization 275 (BN) [31], and Mish activation function, whereas the CBL 276 module is composed of Conv, BN, and Leaky_ReLU [32] 277 activation function. The dimensions of convolution cores in 278 front of the Cross-Stage Partial connections (CSP) module are 279 3 × 3, which is equivalent to downsampling [33]. SPP uses 280 fixed-block pooling operation, with the maximum pooling for 281 the blocks with a kernel size of 1 × 1, 5 × 5, 9 × 9, and 282 13×13, which refers to tensor splicing, dimension expansion, 283 and outputting, after a series of concatenations. EfficientDet [27] uses as a backbone the EfficientNet [34] -a 286 pre-trained network based on ImageNet data set. The 3-7 level 287 feature maps (i.e., P3, P4, P5, P6, and P7) are extracted from 288 the backbone, fed into the BiFPN layer, then integrated (from 289 top to bottom), and finally sent to the prediction network and 290 category prediction network, as shown in Figure 6.

291
EfficientDet proposes a new compound scale method 292 for target object detection by using a larger backbone and 293 changing all aspects of the backbone, BiFPN, classification 294 network, bounding box prediction network, and resolution 295 through a recombination coefficient ϕ, as follows:   (2) and (3):  [20] in the loss function. The resul-321 tant model, whose structure is shown in Figure 11, is called 322 YOLOv4_CE.   activation function is reduced, two standardized BatchNorm 348 (BN) [31] layers are removed, BN is replaced with Layer 349 Normalization (LN) [39], and a 2 × 2 convolution layer with 350 a step size of 2 for spatial downsampling is used.

351
ConvNeXt has different architectures depending on dif-352 ferent stacks of blocks used. In the proposed YOLOv4_CE 353 model, the ConvNeXt-S architecture is utilized with (3, 3, 27, 354 3) stacking, as shown in Figure 9. The Coordinate Attention (CA) mechanism [15] encodes the 357 channel relationship and long-term dependences by accurate 358 positional information with a simple overall structure flow, 359 as shown in Figure 10.

360
Firstly, the input feature graph is divided into two direc-361 tions of width and height for global average pooling. The 362 output at height h and width w of channel C can be expressed 363 as follows: Then, CA stitches the generated aggregation feature map, 367 performs 1 × 1 convolution, and obtains function f after 368 applying normalization and activation function, as shown 369 below: This is followed by two convolutions and sigmoid activa-372 tion function for f h and f w , respectively, and transforming 373 these to tensors with the same channel, as follows: Finally, CA extends g h and g w outputs as to use these as 377 attention weights. The final output is: The task of the target object detection is to recognize and 381 locate target objects, for which a loss function is utilized 382 to make the recognition and localization more accurate.

383
In YOLOv4, the Complete Intersection over Union (CIoU) 384 loss is used as a loss function, which is formulated in [19] as:  The Efficient IoU (EIoU) loss allows to achieve faster 399 convergence by calculating the height and width of the target 400 and predicted frames separately, as shown below: where h and w denote the height and width of the target frame, 404 h gt and w gt denote the height and width of the predicated 405 frame, and C h and C w denote the height and width of the min-406 imum bounding rectangle covering the target and predicted 407 frames.

408
As the EIoU loss splits the loss item of the aspect ratio into 409 the difference between the width and height of the predicted 410 frame and the width and height of the minimum bounding 411 box, respectively, it allows to accelerate the convergence 412 and improve the regression precision. These were the main 413 reasons for adopting the EIoU loss for use as a loss function 414 by the proposed YOLOv4_CE model.

468
Based on precision and recall, the F1 score and mean 469 average precision (mAP) were used as the main evaluation 470 metrics in the experiments. These metrics are defined as 471 follows: where p (r) denotes the precision function of recall (r).

480
In the conducted experiments, the precision-recall curves 481 were first created for each of the compared models, for each 482 class of objects in the corresponding data set used, based on 483 the obtained values of recall and precision. Then, these curves 484 were used to calculate the AP of each model for each class 485 of objects, separately for each experiment, based on (18). 486 Finally, in order to compare the overall target object detection 487       below. Then the other metric, F1 score, was used, separately 492 for each model in each of the five experiments, and the cor-493 responding values were averaged to obtain the final F1 score 494 result for each model, as summarized in Tables 8 and 15.  Tables 2-7. Based on these, the 500 VOLUME 10, 2022       five state-of-the-art models on this evaluation metric too. 512 More specifically, Faster R-CNN, Gliding Vertex, Oriented 513 R-CNN, EfficientDet, and YOLOv4 are outperformed by 514 0.200, 0.060, 0.002, 0.089, and 0.041 points, respectively.

515
The most challenging for target object detection proved to 516 be the images with complex background (bridge and basket-517 ball court classes) and the images containing intensive small 518 targets (vehicle and harbor classes).  Tables 8-14. Based on these, the 523 averaged mAP values were calculated, as shown in Table 15. 524 The obtained results confirm that the proposed YOLOv4_CE 525 model outperforms, in terms of mAP, all five state-of-the-526 art models on this data set too. More specifically, Faster 527 R-CNN, Gliding Vertex, Oriented R-CNN, EfficientDet, and 528 YOLOv4 are outperformed by a similar degree as on the other 529 data set, namely by 10.39, 7.28, 4.51, 6.59, and 3.81 points, 530 respectively.

531
Then, the F1 score values were calculated in each experi-532 ment for each model and then averaged to produce the final 533 results presented in Table 15 742 He is currently a Professor with the North China 743 University of Science and Technology, China, and 744 an Associate Researcher with the Telecommu-745 nications Research Centre (TRC), University of 746 Limerick. He has authored/coauthored more than 747 100 research papers in refereed journals and con-748 ferences. His research interests include ubiquitous consumer wireless world 749 (UCWW), the Internet of Things (IoT), cloud computing, big data manage-750 ment, and data mining. Invited Lecturer, currently associated with the 759 University of Limerick, Ireland, the University of 760 Plovdiv ''Paisii Hilendarski,'' and IMI-BAS, Bul-761 garia. He was involved in more than 40 international and national research 762 projects. He has served on the TPC of more than 350 prestigious international 763 conferences/symposia/workshops, and has authored/coauthored one mono-764 graphic book, three textbooks, four edited books, and more than 300 research 765 papers in refereed international journals, books, and conference proceedings. 766 He is on the editorial board of and has served as a guest editor for multiple 767 renowned international journals.