Aircraft Target Detection in Remote Sensing Images Based on Improved YOLOv5

Dealing with the insufficient detection accuracy and speed of aircraft targets in remote sensing images under complex background, this paper proposes a new detection method, YOLOv5-Aircraft, based on the YOLOv5 network. The YOLOv5-Aircraft model is improved in 3 ways: (1) At the beginning and end of original batch normalization module, centering and scaling calibration are added to enhance the effective features and form a more stable feature distribution, which strengthens the feature extraction ability of network model. (2) The cross-entropy loss function in the confidence of the original loss function is improved to the loss function based on smoothed Kullback-Leibler divergence. (3) For reducing information loss, the CSandGlass module is designed on the backbone feature extraction network of YOLOv5 to replace the residual module. Meanwhile, low-resolution feature layers are eliminated to reduce semantic loss. Experiment results demonstrate that the YOLOv5-Aircraft model can enhance the accuracy and speed of aircraft target detection in remote sensing images while achieving easier convergence.


I. INTRODUCTION
With the continuous development of satellite remote sensing technology, the information amount of high-resolution remote sensing images has increased sharply, and the detailed information contained in is getting more abundant. Some sensitive targets such as ships, tanks, airplanes and ports can also be clearly visible to naked eyes, for which the detection methods have become a hot spot for scholars. Aircraft play an irreplaceable role in both the civilian and military fields. Therefore, the detection method of aircraft targets in remote sensing images is of great significance.
However, the detection of aircraft targets in remote sensing images remains to be a challenging problem because it is susceptible to interference of external factors such as weather, light, shadows, etc. Besides, when there are small targets in the images with high exposure and complex background, the difficulty of aircraft detection is expected to rise.
Many solutions have been proposed to solve the above problems of target detection [1]. Traditional methods such as The associate editor coordinating the review of this manuscript and approving it for publication was Jeon Gwanggil . template matching are fast, simple and easy to implement, but they have high requirements on the target state and target size and perform badly in complex backgrounds. The machine learning methods are designed to be flexible and highly targeted, but they are solidified and have poor robustness [2]. In recent years, deep learning methods have developed rapidly. Many target detection algorithms based on CNN (Convolutional Neural Networks) have been proposed and applied to target detection in remote sensing images [3], [4]. At present, target detection methods can be classified into two main types: Two-Stage methods and One-Stage methods [5]. The Two-Stage method is a deep convolutional network based on the candidate region. It first generates possible candidate blocks containing the detection target, and then classify and correct the candidate blocks and obtain the detection frame to achieve target detection. The more common algorithms are R-CNN (Region CNN) [6], Fast R-CNN (Fast Region-Based CNN) [7] and Faster R-CNN (Faster Region-Based CNN) [8], etc. These methods have high detection accuracy, but low speed. The One-Stage method is based on the target detection of the deep convolutional network of regression calculation, which uses an end-to-end target detection method, such as SSD (Single Shot MultiBox Detector) [9]- [11], YOLO series [12]- [15] and so on. These methods have a faster detection speed and can meet real-time requirements.
More specifically, scholars have done a lot of research work on target detection in remote sensing images. Reference [16] used the k-means algorithm to cluster the data set, and learned from the Densenet network idea to improve the YOLOv3 network to detect aircraft targets in remote sensing images, which greatly improves the detection accuracy. Reference [17] introduced the spatial pyramid pooling structure, transition module and residual network to improve the YOLOv3 network, and the comprehensive performance indicators for detecting ship targets in remote sensing images were greatly improved. Reference [18] used the PIIFD descriptor to process the transformation between the background and the target of different images, and verified that it had better performance in remote sensing image target detection in the geographic space environment. Reference [19] strengthened the CSP feature extraction network of the YOLOv4 network, replaced the original activation function with the Mish function and added a pyramid pooling module to reduce the scale sensitivity, and improve the detection accuracy and recall rate. Reference [20] used a multi-scale fusion method to solve the problem of small target semantic information transmission in a fully convolutional neural network. In summary, it can be seen that the deep learning methods have high application value in remote sensing image target detection. Therefore, we tested YOLOv3, YOLOv4, and YOLOv5 on aircraft targets detection in remote sensing images. The experimental results show that the detection accuracy is high, but the detection speed is too low to meet the requirements of real-time detection. For images with complex backgrounds, the complexity of the network structure will increase the difficulty of training and reduce the detection speed. Meanwhile, overfitting is prone to occur when the amount of data is small and the network structure is too simple to effectively describe the feature of the target, which results in a decrease in detection accuracy. In the task of small target detection, traditional convolutional layers usually fail to be both accurate and realtime because it is difficult to extract the characteristics of small targets, no matter for a simple network or a complex one.
This paper presents a network model, YOLOv5-Aircraft, based on improved YOLOv5 to enhance the detection accuracy and detection speed of aircraft targets in remote sensing images. The content of this paper is arranged as follows. Chapter 2 describes the YOLOv5 target detection model. Chapter 3 explains in details how to improve the YOLOv5 model to YOLOv5-Aircraft. Chapter 4 conducts experimental analysis. Finally, Chapter 5 gives the research conclusions.

II. INTRODUCTION OF YOLOv5 DETECTION NETWORK
YOLOv5 proposed by Ultralytics LLC is an improved version based on YOLOv4. It is a one-stage detection network in terms of accuracy and detection speed [21]. After learning from the advantages of the previous version as well as other networks, YOLOv5 changes the characteristics of the previous YOLO target detection algorithm that the detection speed is faster but the accuracy is not high. YOLOv5 has improved detection accuracy and real-time performance, which not only meets the needs of real-time image detection, but also has a smaller structure. Therefore, this article uses YOLOv5 as the detection model. Its network model is divided into 4 parts, namely Input, Backbone, Neck and Prediction, and its network structure is shown in figure 1 [22].
Input includes three parts: mosaic data enhancement, adaptive anchor frame calculation and adaptive image scaling. The input terminal of YOLOv5 adopts the same mosaic data enhancement method as YOLOv4. The random clipping, random scaling and random distribution are used to splice the images. The four images are spliced, which enriches FIGURE 1. The main modules of YOLOv5 network. VOLUME 10, 2022 the detection data set, improves the robustness of the network, reduces the calculation of GPU, and increases the universal applicability of the network; Adaptive anchor frame calculation sets the initial anchor frame for different data sets, outputs the prediction frame on the basis of the initial anchor frame, and then compares it with the real frame. After calculating the gap, it updates the network parameters reversely and iterates the network parameters continuously. The anchor frame parameters are [116,90,156,198,373,326], [30,61,62,45,59119], [10,13,16,30,33,23]. Adaptive image scaling is to scale the image to a uniform size, which has been implemented in the data preprocessing stage.
Backbone includes focus structure and CSPnet (cross stage partial network) structure. Focus slices the image of 608 × 608 × 3 to get the feature map of 304 × 304 × 12. Then, after convolution of 32 convolution kernels, the feature map of 304 × 304 × 32 is obtained, and the process is shown in figure 2. Neck uses FPN (feature pyramid networks) and PAN (pyramid attention network) structure, and its structure is shown in figure 3. FPN transfers and fuses high-level feature information through up sampling from top to bottom to convey strong semantic features. PAN is a bottom-up feature pyramid to convey strong positioning features. Both of them are used at the same time to enhance the ability of network feature fusion.
Prediction includes bounding box loss function and NMS (non-maximum suppression). YOLOv5 uses GIOU loss function as the loss function of bounding box, which effectively solves the problem of non coincidence of bounding boxes, and improves the speed and accuracy of prediction box regression. In the object detection and prediction stage, weighted NMS is used to enhance the ability to recognize multiple objects and occluded objects, and obtain the optimal object detection frame.

III. IMPROVEMENT OF YOLOv5
A. BATCH NORMALIZATION IMPROVEMENTS Batch normalization (BN) has become the default component of modern neural network stability training. In BN, centering and scaling operations as well as mean and variance statistics are used for feature normalization on batch dimensions. The batch dependence of BN makes the network have stable training and better representation. However, BN inevitably ignores the representation differences between instances. In order to perform feature correction in BN, centering and scaling calibration were added at the beginning and end of the original normalization layer of BN, respectively [23]. Given input feature X ∈ R N ×C×H ×W , where N , C, H , and W are batch size, the number of channels, height, width of the input feature, respectively, the centering calibration of features is written as follows: where w m ∈ R 1×C×1×1 is the learnable weight vector, and its value changes with the number of network layers as shown in Figure 4(a). The value in most layers is close to 0 and its absolute value increases as the number of layers increases, because the higher the number of layers, the network has more instance-specific features. X cm is the centering calibration of features. is the dot product operator that broadcast two features to the same shape and then conduct dot product. Then the centered features with the centering calibration can be written as: where E(X cm ) is the mean of X cm . By scaling X m like BN, we can deduce the following formula: where Var(X cm ) is the variance of X cm , and ε is used to avoid zero variance. Then, the scaling calibration operation is added to the original scaling operation: where w v , w b ∈ R 1×C×1×1 are learnable weight vectors, as shown in Figure 4(b). Similar to x, their value tends to 0 in most layers and its absolute value increases with the increase of the number of layers. and R() is the restricted function, which can be defined with multiple forms. In this work, we choose to use the Tanh function to suppress extreme values. Similar to K m , K s is the statistics of the instance feature X S , that can be set to multiple values. The restricted function R() along with the w v and w b in Eqn. (4) suppress out-of-distribution features, making the feature distribution more stable. Finally, the trained learnable scale factor γ and deviation factor β are linearly transformed to obtain the final representative batch normalization result Y . The affine transformation can be written as follows: To utilize the optimization of batch normalization in existing deep learning frameworks, we add the centering and scaling calibrations at the beginning and ending of the original normalization layer of batch normalization, respectively, which enhances the effective features and forms a more stable feature distribution, and enhances the feature extraction ability of the network model. We extract the feature map in the network, and the result is shown in the figure 5. It can be seen that the feature map on the far right after the centering and scaling calibration is more significant than the original feature map output in the middle.

B. LOSS FUNCTION BASED ON KULLBACK-LEIBLER DIVERGENCE
The mean square error (MSE) is used in the target frame coordinate regression process of YOLOv5, and the cross entropy is used as the loss function of confidence and category. However, as the loss function of the target frame, the loss of MSE is more sensitive to the target frame. In order to further improve the convergence stability, this paper improves the cross-entropy loss function to the smoothed Kullback-Leibler divergence loss function when designing the loss function for confidence. KL divergence is also called relative entropy [24]. For two probability distributions P and Q of the same continuous variable, the definition of KL divergence is: In this paper,φ is used to represent the parameter change process of minimizing the KL divergence between the predicted probability distribution and the real label distribution of n input samples. The formula is as follows: where Q D (x) is the probability distribution of real label coordinates, P φ (x) is the probability distribution of predicted coordinates, and Q D (x) and P φ (x) are defined as Gaussian distribution functions. Therefore, the boundary box regression loss function L KL can be written as: (8) Then, according to the properties of Gaussian distribution function, the derivation is written as follows: The definition of Dirac delta function is shown in (10). Since Gaussian distribution is the approximation of Dirac delta function when the standard deviation is close to 0, equation (9) is deduced according to the screening property of Dirac delta function in equation (11), and finally equation (12) is obtained. The equations are as follows: δ(x)dx = 1 (10) VOLUME 10, 2022 where x e is a constant term. Since the constant term has no effect on the derivation, the term without parameters can be removed as follows: where x g is a constant term. If the initial value of σ in equation (13) is large, it is easy to cause gradient explosion in the initial stage of training, which leads to the failure of convergence of the model. And The input of x in function ln x is limited in mathematical calculation. Therefore, in the prediction stage of model training, this paper sets the relationship between variables α and σ as shown in equation (14), and then brings equation (14) into equation (13) to obtain the loss function equation (15): In order to further enhance the robustness of the model, the KL divergence loss function is smoothed. When x g − x e > 1, the regression loss function of the model's bounding box is: In the process of model training, the smoothed loss function will not produce a sudden change to the noisy sample data, so as to reduce the interference in the process of back propagation, and the convergence of model is more stable.

C. IMPROVEMENT OF NETWORK STRUCTURE
The CSPNET structure in YOLOv5 divides the feature layer of the base layer into two parts and then uses a cross-stage hierarchical structure to merge the two, so that the network can achieve richer gradient combination information, but this is also more likely to cause information loss and gradient confusion. Therefore, this paper draws on the ideas of mobileneXt [25] and uses the hourglass-like module CSand-Glass to replace the Res unit module in the YOLOv5 network. The structure of the CSandGlass module is shown in figure 6. Unlike the bottleneck structure with depthwise convolution in the middle, this paper moves the 3 × 3 depthwise convolution layer (Dwise) to both ends of the residual path with highdimensional representation, and the two basic components of YOLOv5, CBL, are placed in the middle. Two depthwise convolutions can encode more spatial information, and make more gradients propagate across multiple layers, reducing information loss. Figure 7 shows the before and after comparison using the CSandGlass module. The two pictures on the left show the results of using 6 consecutive convolutions without using the CSandGlass module. It can be seen that the edge features of the aircraft are not well extracted, and the information is severely lost. The two pictures on the right are the improved feature extraction results using CSandGlass. The edge feature information of the building has been better extracted, and the background information and feature information are also more distinct.
In the input of the original version of YOLOv5, the feature number of the fully connected layer behind the convolution layer is fixed, so that the size of our input image will be fixed at 608 × 608, and the sizes of the feature layer network are 19 × 19, 38 × 38, 76 × 76, respectively. The smaller the size of feature layer is, the larger the receptive field of neurons is, which means that the semantic level is richer, but the local and detail features will be lost. On the contrary, when the convolutional neural network is shallow, the receptive field becomes smaller, and the neurons in the feature map tend to be partial and detailed [20]. In order to reduce semantic loss, This not only reduces the semantic loss, but also reduces the amount of network parameters. Figure 8 shows the improved network structure of YOLOv5, where CSG is the CSandGlass module, and RBN is the improved BN module.

A. EXPERIMENTAL ENVIRONMENT
In this paper, the deep learning platform is built in OpenCV. Test environment: NVIDIA Tesla V100, 16G GPU memory, CUDA version 10.1, cudnn version 7.6.5, and python 3.8 as the compiler language.

B. EVALUATING INDICATOR
In the field of object detection, recall, precision and mAP (mean Average Precision) are usually used to evaluate the performance of object detection algorithm. Recall rate is used to describe how many samples are detected in prediction [26]. The calculation formula is as follows: where R is the recall rate, TP is the number of positive samples where P is the precious rate, FP is the number of individuals who predict negative examples in the sample as positive examples, that is, the object of detection errors. However, in general, it is difficult to maintain both the recall rate and the precious rate at a high level. Therefore, a parameter is needed to integrate these two parameters. The mAP is used to measure the algorithm performance of the detection network. It is suitable for single-label and multi-label image classification and calculation. The equation can be written as: where N is the number of samples in the test set, P(k) is the size of the precious rate when k samples are recognized at the same time, R(k) is the change in the recall rate when the number of detected samples changes from k − 1 to k, C is the number of categories in the multi-class detection task.

V. DATA PREPROCESSING A. THE SOURCE OF DATA SET
The remote sensing images studied in this paper are from Google Earth, with 78 images in total [27]- [29]. These images are remote sensing images containing aircraft targets, including non-target images, single-target images and multitarget images. Figure 9 shows several typical remote sensing images in the data set.

B. CONSTRUCTION OF DATA SET
The original picture size is relatively large. If the original size is used as the training data set, it will cause too many parameters. Therefore, the original aerial picture size is reduced to 608 pixels × 608 pixels by pixel transformation. And on the basis of the original image, the image is rotated, cropped, and contrasted, so that the remote sensing image has different manifestations and scales, which helps to avoid the occurrence of overfitting, thereby improving the generalization ability of the training network [14]. Figure 10 shows pictures in different forms after preprocessing. After preprocessing, 1000 remote sensing pictures are finally obtained. Imitating the format of the VOC2007 data set, this paper uses LabelImg to mark the outer frame of the aircraft targets in these images in turn, and converts them into the XML format required for training [30]. Labelimg is an image annotation tool in deep learning, which is used to annotate the category name and location information of objects in the image.

C. RESULTS AND ANALYSIS
In order to compare YOLOv5 and the improved model proposed in this paper, the processed images will be trained with the same number of epochs. At the same time, use the YOLOv4 network to make multiple comparisons under the same conditions. For convenience of comparison, the improved model is called YOLOv5-Aircraft. Firstly, compare the loss reduction between the models. Figure 11 shows the loss graphs of the three models. The abscissa is the number of epochs, and the ordinate is the loss. The blue line, orange line and green line respectively represent three different models. It can be seen that the loss of YOLOv5-Aircraft decreases faster than that of YOLOv5 and YOLOv4, indicating that the loss of YOLOv5-Aircraft converges faster. After convergence, the loss of YOLOv5-Aircraft is closer to 0 and smoother.
While YOLOv4, YOLOv5 and YOLOv5-Aircraft are comparatively analyzed, this paper combines the test results of models such as YOLOv3 and Faster RCNN for multivariate analysis, as shown in TABLE 1.
In TABLE 1, FPS (Frame Per Second) is the detection speed, which is the number of images that the algorithm can detect per second. Analyzing the data in TABLE 1, it can be seen that the detection accuracy index of YOLOv5-Aircraft is improved compared with the original YOLOv5 network,  mAP is increased by 3.74%, and the detection speed is also greatly improved by 6.93. From the comparative data, it can be seen that YOLOv5-Aircraft has improved its ability to accurately predict the location of aircraft, and its detection speed has also been greatly improved. Faster RCNN adopts two-stage detection mechanism and fine-tuned the anchor area twice, but its mAP only exceeds YOLOv3 compared to other algorithms and the detection speed is much lower than the latter. The improved YOLOv5 model proposed in this paper is further subjected to ablation experiments to verify its effectiveness. Some network modules are replaced, and the results are shown in TABLE 2, where CUT is the operation of removing low-resolution feature layers. From the data in the table, it can be seen that The use of RBN module, CSG module and loss function based on KL divergence all improve the accuracy and speed of detection. Removal of low-resolution feature layers improves the detection speed, but reduces the detection accuracy. Because the RBN module strengthens the feature extraction capabilities of the network, its detection accuracy and speed have been greatly improved by 1.21% and 2.29 respectively.. The loss function based on KL divergence reduces the noise interference and improves the detection accuracy to a certain extent, but it has no obvious effect on the detection speed. The CSG module reduces the information loss in the gradient descent process, and also improves the accuracy and speed of detection to a certain extent. The network without the low-resolution feature layer has one feature layer less than the network with the low-resolution feature layer and reduces operations such as convolution and splicing, which significantly improves the network detection speed, but also reduces the detection accuracy. The shortcoming after removing the low-resolution feature layer also shows the necessity of RBN module, CSG module and the loss function based on KL divergence to improve the accuracy of model detection.
In addition, YOLOv5-Aircraft performs well when the background of the detection image is complex, and the detection results are shown in Figure 12. From the comparison of the pictures in the first two rows, we can see that YOLOv5 has missed detection of some targets. The improved YOLOv5 improves the detection accuracy of small aircraft targets, and there is no missed detection. Moreover, as shown in the third row of the picture, when the brightness of the picture is increased due to sunlight, the original algorithm missed the detection more serious, and the improved YOLOv5 can well identify all aircraft targets in the image, indicating the improved algorithm still has good recognition ability in abnormal lighting conditions.

VI. CONCLUSION
We proposed a convolutional neural network-based aircraft detection algorithm, YOLOv5-Aircraft, to detect and track aircraft targets in remote sensing images under complex backgrounds. Various types of aircraft targets in remote sensing images were evaluated and researched. Firstly, we utilised batch normalization module added centering and scaling calibration to strengthen the feature extraction ability of network model. Then the cross-entropy loss function in the confidence of the original loss function was improved to the loss function based on smoothed Kullback-Leibler divergence. Finally, the CSandGlass module was designed to replace the residual module for reducing information loss and the low-resolution feature layer of the backbone network was removed for reducing the loss of local details, which ultimately led to an efficient and accuracy aircraft targets detector, applicable in various complex environment.
The proposed method was evaluated for remote sensing image data set from Google Earth and Vaihingen data set, including 2000 frames, and approximately 13,000 aircraft targets. YOLOv5-Aircraft was demonstrated to be able to perform in a variety of challenges including shades, lighting variations and partial visibility, and showed a major development in terms of accuracy(85.25%) and speed(48.85fps). Therefore, YOLOv5-Aircraft offered a robust aircraft target recognition algorithm in remote sensing images. However, through a large number of test experiments, it is found that factors such as light and weather still have a certain influence on the detection results. In the subsequent training process, it is necessary to collect more image data in a complex environment to improve the generalization ability of the YOLOv5-Aircraft model.