Vehicle detection in aerial images based on lightweight deep convolutional network

Vehicle detection in aerial images is an interesting and challenging task. Traditional meth-ods are based on sliding-window search and handcrafted features, which limits the representation power and has heavy computational costs. Recent research have shown that deep-learning algorithms are widely used in the ﬁeld of object detection. However, the deep-learning algorithms still face many difﬁculties and challenges in the object detection under the aerial scene. Meanwhile, the high computational cost of detection algorithms lead to low-detection efﬁciency. In this study, we build a fast and accurate lightweight detection framework for vehicle detection in aerial scenes. The proposed detection method improves the expressive ability of detection network and signiﬁcantly reduces the amount of calculations in the model. Meanwhile, setting suitable anchor boxes according to the size of the object vehicles have been introduced in our model, which also effectively improves the performance of the detection. In addition, we have published a new aerial vehicle image dataset and veriﬁed the effectiveness of our method. In the Munich dataset and our dataset, our method achieves 85.8% and 91.2% of the mean average precision (mAP), and its detection time is 1.78 and 0.048 s on Nvidia Titan XP. Our results show that the proposed framework achieves signiﬁcant improvement over several alternatives and state-of-the-art schemes with higher accuracy and less detection time.


INTRODUCTION
Vehicle object detection in aerial photography scenarios plays an important role in both military and civil fields. This technology can be effectively applied to traffic management and military target strikes. In the civil field, vehicle recognition is one of the important directions of object detection research in aerial imagery scenes, and it is an important part of the intelligent transportation system. It provides technical support for realtime road conditions acquisition, accident monitoring, illegal parking monitoring and other application scenarios. In the military field, object detection of aerial vehicles can obtain a wealth of battlefield information and can accurately strike specific targets. Object detection in aerial photography requires real-time detection, which has high requirements for detection speed and accuracy. On the unmanned aerial vehicle (UAV) embedded platform, the detection model needs to be able to be separated from the graphics processing unit (GPU) device for real-time detection. This requires reducing the size and calculation of the model to meet the requirements of real-time detection.
Recently, vehicle detection based on aerial image has received extensive attention and achieved a series of good detection results [1][2][3][4][5][6]. However, these detection methods still have obvious shortcomings. First, the size of aerial images is generally very large while the vehicle is very small, and it is really difficult to detect several small vehicle objects in a large range. Second, the vehicles have varieties of styles and colours, and they are often accompanied with a complex background information such as occlusion, shadows.
In previous studies [7][8], various algorithms for vehicle detection in aerial images have been proposed, which have improved the robustness and effectiveness in the past period. The most common algorithms are based on the sliding-window method and apply the filter to all possible locations and scales in the image. Although these traditional target detection algorithms have achieved some good detection results, their disadvantages are also obvious. First, the hand-crafted features and shallow learning-based features restricted the ability to extract and represent features. Second, lots of redundant computations were produced in the sliding-window method, which will significantly increase the computational burden.
Deep convolutional neural networks (CNN) are widely used in semantic segmentation [9][10] and object detection [11][12][13][14][15]. In the field of object detection, the method based on convolutional neural network has achieved exciting results in natural scene images. There are some typical models working very well in object detection such as R-CNN [11], Fast R-CNN [12], Faster R-CNN [13], you only look once (YOLO) [14][15], single shot detector (SSD) [18] and so on. Running high-quality CNN model with strict constraints on memory and computational budget is becoming a rising interest in computer vision. In recent years, many innovative networks have been proposed such as PAVNet [21], MobileNets [22], ShuffelNet [23], NAS-Net [24]. However, these architectures heavily rely on deep separable convolution operations, which lacks efficient implementation. At the same time, there are few researches on the combination of efficient algorithms with fast object detection model. The SSD [18] is one of the first attempts at using a pyramidal feature hierarchy. Ideally, the SSD-style pyramid would reuse the multiscale feature maps from different layers computed in the forward pass and thus come free of cost. But to avoid using lowlevel features, SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network and then by adding several new layers. Thus, it misses the opportunity to reuse the higher-resolution maps of the feature hierarchy.
In this study, we propose a fast and accurate vehicle detection framework in large-scale aerial image (see Figure 1). Our method combines the lightweight feature extraction network with the object detection algorithm Faster R-CNN, which is more suitable for the detection of small objects in aerial scenes. Moreover, we set an appropriate scale and ratio of the anchor box according to the size of object vehicle in aerial image, which will further improve the effectiveness of the detection. We evaluate and validate the effectiveness of the proposed method in Munich dataset [2] and our collected dataset [45]. The results show that our method achieves the most advanced level in terms of detection accuracy and time consumption. At the same time, the limited annotation data would easily lead to over-fitting of the model, and the large-scale aerial images would increase the detection time. In order to overcome these problems, we segment the aerial image into blocks, and increase the amount of data through rotation, scale and Gaussian noise.
The main contributions of our work are presented as follows: 1. A fast and effective lightweight vehicle detection model in aerial scenes is established. In this detection framework, a series of methods for small object detection such as transition layers with same channels, feature extraction under small convolution kernel and three-way dense layer for fusing different feature maps have been implemented in our proposed network, which improves the expressive ability of detection network and significantly reduces the amount of calculations in the model. 2. We combine the proposed lightweight feature extraction model with the object detection algorithm Faster R-CNN and set an appropriate size anchor box according to the size of the vehicle object, so that it can better detect small objects in aerial photography scenes. 3. Extensive ablation studies validate the effectiveness and efficiency of the proposed approach. Moreover, we have published our aerial vehicle image dataset with the ground truth and verified the effectiveness of our method on this dataset, which further validates the effectiveness and robustness of our model.

RELATED WORK
In this section, we briefly introduce the recent methodologies related to object detection including the method of traditional handcrafted features and deep CNN. In addition, some recent vehicle detection models are introduced as well.

Traditional handcrafted features for object detection
In feature extraction, Haar wavelet, local binary pattern (LBP), scale invariant feature transform (SIFT) and histogram of oriented gradient (HOG) are typical handcrafted features for object detection. Viola et al. [26] proposed an integral image method of Haar wavelet feature to increase the speed of object detection. This research is successfully applied to face detection [27]. The LBP is used to extract the texture features of the image [28], which has characteristics of rotation and grey invariances. SIFT is a local feature description operator [29], which is robust to the transformation of light, noise as well as the angle of view in the image.
The HOG gradient histogram algorithm uses a sliding window to filter all possible positions and scales in the image [30]. Additionally, the deformable parts model (DPM) algorithm calculates the gradient histogram of HOG, and then uses support vector machine (SVM) to perform the matching and classification of the object [31]. As an important object detection algorithm, DPM has advantages in posture analysis and object location.
However, these feature extraction methods are manually designed, and based on the understanding of the specific task by the designers, which usually have less actual parameters and unstable performance. Although traditional object detection methods show good performance in the specific detection task, its disadvantages are also obvious, therefore, it still cannot meet the requirements of large-scale data at the current stage.

Object detection with the deep CNNs model
There are some typical CNN detection models such as regions convolutional neural networks (R-CNN) [11], Fast R-CNN [12], Faster R-CNN [13], YOLO [14], YOLOv2 [15], SSD [18], feature pyramid networks (FPN) [19], and RefineDet [20]. The R-CNN uses a selective search algorithm to generate object-like regions and extracts deep features on classification with an SVM classifier [11]. Although this algorithm has improved the efficiency of detection, a large number of repetitive operations will restrict the performance of the model. Compared with R-CNN, Fast R-CNN has some greater advantages in detection perfor-mance [12]. It designs a pooling layer structure of ROI pooling, which solves a lot of repetitive operations in R-CNN and greatly improves the performance of the algorithm. However, Fast R-CNN requires selective search to generate positive and negative samples, which limits the efficiency of the detection. The Faster R-CNN employs a region proposal network (RPN) auxiliary network to determine whether there is an object in the candidate box, and then it performs classification detection through classified positioning [13]. In Faster R-CNN, the CNN of the whole process can share the extracted feature information, which greatly improves the computational efficiency. YOLO [14] algorithm is based on the global information of the image, and it directly predicts the coordinate information and confidence of the category in each grid, which has improved the speed of detection. On the basis of YOLO, YOLOv2 [15] introduces batch normalisation, high resolution classifier and convolution with anchor boxes, which makes the detection model much better and stronger. SSD [18] algorithm helps to predict object regions on feature maps of different convolution layers which output discrete multiscale default coordinates.
Although these detection models based on CNNs have achieved high accuracy, the deep convolutional networks with enormous computational overhead leads to longer detection time. Recently, a series of lightweight networks have been developed, including PVANET [21], MobileNet [22], ShuffleNet [23], PeleeNet [25], GoogleNet [32], Xception [33], ResNeXt [34], and SqueezeNet [35], which have less complex structures and provide comparable or even better results compared with other CNNs. However, in the object detection application, there are few studies combining the backbone network with detection algorithm.
PAVNET [21] is mainly aimed at improving the speed of the detection model. In PAVNET, the authors propose a lightweight network, which is based on the principle of more layers with less channels. In addition, the authors use concatenated rectified linear units (C.ReLU) and Inception structure to reduce network redundancy. In PeleeNet [25], the authors combine the parallel multi-size kernel convolutions together into a two-way dense layer to the squeeze model, and adopt a residual block to improve the accuracy after feature extraction stage. Compared with MobileNet and ShuffleNet on the Pascal VOC dataset, the PeleeNet has more accurate results and smaller model size. However, since the SSD algorithm uses multiscale hierarchical feature maps for object detection, it tends to be not suitable for small-target detection and the model also generates a large number of redundant calculations. Therefore, the detection structure of the backbone PeleeNet with SSD detection algorithm is not suitable for vehicle detection under large-scale aerial scenes.

Vehicle detection with the deep CNNs model
The detection model with deep CNN has great advantages in the field of object detection, and it has become an important detection method in this research. Yuan et al. [36] present a sequential vehicle detection model to deal with serious occlusions in natural scenes. The algorithm can effectively detect and track vehicles in complex interaction scenarios. Wang et al. [37] performs traffic congestion detection by detecting vehicles on the road. In this algorithm, the author uses the low-level texture information and kernel regression to detect the degree of traffic congestion.
Nassim et al. [38] propose to segment the aerial image into similar regions on the basis of the CNN, and then they use a trained SVM classifier to locate and classify the object after determine the candidate region of the vehicle. Cheng et al. [39] propose rotation invariant convolutional neural networks (RICNN) algorithm for target detection in aerial image. Although these measures mentioned above further improves the detection performance, the algorithm still significantly increases the overhead of the network. Tuermer et al. [43] suggest an innovative processing chain for vehicle detection that is able to exclude large homogeneous regions of multichannel images without exclusion of useful information, whose results show a high-detection rate and good reliability. Audebert et al. [44] present a deep-learning-based segment-before-detect method for segmentation and detection of several varieties of vehicles in remote sensing images, and this research shows that deep-learning is also suitable for object-oriented analysis of aerial images because object detection can be obtained as a byproduct of accurate semantic segmentation.
Recently, some related studies employ Faster R-CNN for vehicle detection in aerial imagery. In [3], the authors present a CNN-based detection model with two convolutional networks, where the first network is applied to generate vehiclelike regions from multi-feature maps. Then, the generated candidate regions are fed into the second network for feature extraction and object detection. In [4], the authors propose a method of vehicle detection in aerial images by adding negative sample marks and establishing an hyper region proposal network (HRPN) network to improve the detection accuracy. However, the aerial image has its own characteristics such as large scale of image, small object for detection, and complicated background information, which usually brings many challenges to the detection work. Therefore, the deep detection model mentioned before cannot be directly used for vehicle detection under aerial scenes, and it is necessary to make targeted improvements according to the characteristics of aerial images.
In this study, we have established a feature extraction network for small targets in aerial scenes, which can better extract the features of small targets. Meanwhile, we set the anchor box of the appropriate size according to the target size in the object detection algorithm Faster R-CNN. Additionally, we combine the feature extraction network with the improved object detection algorithm Faster R-CNN, which has a significant improvement to the accuracy and speed of detection.

MATERIAL AND METHODS
The framework of our proposed vehicle detection method is illustrated in Figure 2. Methods based on the deep neural network have high requirements on GPU memory during image processing, especially for processing large-size images. Meanwhile, the number of aerial image datasets is limited, which easily leads to over-fitting. Considering these difficulties and challenges, we preprocess and augment the dataset according to [4] and [5]. In the training phase, we crop the original large-size images into blocks, and increase the amount of data through rotation, scale and Gaussian noise. In the testing stage, the test results are obtained based on the trained model, and the detection results of image blocks are stitched together to recombine the original image.

The blocks in the proposed network
Our proposed detection network is composed of the stem and stage blocks. The highlights of this network are adopting transition feature maps with equal channel width, small convolution kernel for feature extraction of small objects and threeway layer to fuse different feature maps.

Stem block
Motivated by Inception-v4 [41] and PeleeNet [25], we design a cost-efficient stem block before the first layer. In the stem block, one branch is stacked with a 1 × 1 and a 3 × 3 convolution, and the other branch is a 2 × 2 max pooling layer (see Figure 3(a)). We find that adding this simple stem structure can evidently improve the detection performance in our experiments. At present, the mainstream lightweight networks usually adopt depthwise separable convolutions, in which pointwise convolutions account for a large part of the computation. However, the memory consumption is still high when the input and output have different numbers of channels. Recently, Shuf-fleNet V2 [40] indicates that equal channel minimises memory access cost (MAC). In the convolution operation, we assume that the size of the convolution kernel is n × n, and the number of channels for input and output are c 1 and c 2 . FLOPs is the abbreviation of floating point operation per second which is a very important standard to measure the operating efficiency. On the feature map of spatial size h × w, the calculation of the FLOPs is as shown in Equation (1): If the storage space of the cache is large enough to store all the feature maps and parameters, the MAC is calculated as shown in Equation (2): In Equation (2), the size of the feature map is h × w and the convolution kernel is n × n, and all input feature maps and convolution kernels need to be loaded into the cache, the number of memory access operations is n 2 (hwc 1 + c 1 c 2 ). Then the result of calculation is returned to memory, which needs to access the memory n 2 hwc 2 times.
The inequality about MAC can be simplified by mean value inequality (c 1 + c 2 >= 2 √ c 1 c 2 ). The simplified MAC inequality is shown in Equation (3): According to this result, it can be found that when c 1 and c 2 are equal, the MAC can take the minimum value.
The stem block can reduce the loss of information in the original input image, which improves the ability of feature expression without adding computational cost too much. Our experimental results show that the addition of this structure can effectively improve the performance of detection. Unlike PeleeNet, we have the same number of feature channels in the stem block (see Figure 3(a)). In our stem block, we only use feature maps with 16 channels for feature extraction and propagation. Our experiments show that this design reduces the computational cost of the stem block by 33.2%.

Stage block
Motivated by GoogLeNet [32] and PeleeNet [25], we apply a three-way dense layer to fuse different scales of feature maps which is named stage block (see Figure 3(b)). In PeleeNet [25], the stage block is used to extract features of various sizes of objects, one way of the block uses a 1 × 1 convolution and a 3 × 3 convolution to extract features of small objects, the other way uses two stacked 3 × 3 convolutions to capture large-size objects. However, in our detection task, it is mainly used to detect small objects in the aerial scene, so we need to perform more refined feature extraction of small objects. In the structure of our stage block (see Figure 3(b)), the design principle fully takes into account the advantages of 1 × 1 small convolution kernel for feature extraction of small objects. We concatenate these two layers and the previous layer for feature fusion. Meanwhile, we add a 1 × 1 convolution kernel on the last layer of each stage block, which enhances the feature representation ability of small objects. Moreover, we still use the design principle of equal channels in the stage block, which can accelerate the extraction and propagation of the features. Recent researches (PeleeNet [25], NIN [42], GoogleNet [32], InceptionNet [41]) show that 1 × 1 convolution kernel plays a cross-channel aggregation role, and it can reduce the parameters and dimensions of the feature maps. In other words, if the 1 x 1 convolution kernel is used, and this operation is a linear combination of multiple feature maps, it can realise the change of feature maps in the number of channels. When the convolutional layer passes through the excitation layer, the 1 × 1 convolution adds non-linear activation to the learning representation of the previous layer, which improves the expressive ability of the network. The 1 × 1 convolution kernel can perform fine feature extraction in a small range, which makes the feature of the small object to be extracted better. Therefore, the advantages of 1 × 1 convolution kernel for small object detection are fully utilised in our model, which improves the detection accuracy and reduces the detection time.

Anchor box
The anchor box has a basic size parameter, which gets different sizes of boxes based on the area scales and aspect ratios. In Faster R-CNN, the anchor box has different shapes (128 2 , 256 2 , 512 2 ) with three different aspect ratios (1:1, 1:2, 2:1). In this way, a total of nine kinds of anchors with different shapes are obtained. On the basis of the obtained feature map, the centre point of the corresponding original image is obtained through a 3 × 3 sliding window. In each window position, an area in the original picture can be inversely derived according to nine different anchors, and then the size and coordinates of the area can be obtained by calculation. However, the anchor box generated in the Faster R-CNN is used to detect the object in the natural scene, which is not suitable for a specific detection task such as small-object detection in aerial scene. As shown in Figure 4, the object detection in the natural and aerial scenes has a big difference. In the natural scene, the size of the object is large, the background is simple, the object features are obvious, and the number of objects is small. While in the aerial image, the size of the object is small, the background is complex, the features are not obvious, and the number of objects is large. Therefore, we need to set the appropriate size anchor box on our detection task.
In the object detection of natural scenes, the parameters of the anchor are set to: Base size is 16, ratios are (2:1, 1:1, 1:2), and scales are (8,16,32). After a series of operations, nine anchors with three scales (128, 256, 512) and three ratios (2:1, 1:1, 1:2) are finally generated (see Table 1). As shown in Table 1, the size of the anchor box generated in the Faster R-CNN is so large that it is not suitable for vehicle object detection in aerial scenes. For the question of how to set the appropriate size of anchor box, YOLO9000 [15] proposes a k-means clustering algorithm to find the appropriate anchor box size. The YOLO9000 [15] does not adopt the method of manually selecting the anchor box, but automatically obtains the size of the anchor boxes by adopting k-means clustering on the bounding box of training set, which can significantly enhance the stability of the model and improve the detection accuracy. However, this method needs to specify the number of k first and is very sensitive to the initialisation of seed points. In addition, this method also significantly increases the amount of calculation. Differential search algorithm [16] is designed to be a stochastic direct search method, which has the advantage of being easily applied to experimental minimisation. Inspired by reference [17], we use a differential search algorithm to optimise the ratios and scales of the anchors. We aim to find the optimal three scales and three ratios on the anchor setting. The goal of this algorithm optimisation is to maximise the overlapping area of the vehicle bounding-box and anchor box. We set the base size of anchor as 100. Through optimisation calculations, we have obtained the best three scales of 0.38, 0.52, and 0.68, and the three ratios of 1.62:1, 1:1, and 1:1.62.
After a series of operations, it produces nine anchors of different sizes (see Table 1), which basically cover the size of all types of vehicles. Compared with the anchor box in the Faster R-CNN, our anchor box increases the mean average precision (mAP) by 9.1% in Munich dataset, and the detection time is reduced by 17%. The reason is that fewer detections (fewer poor aligned boxes with small confidence scores) are generated and thus the non-maximal suppression (NMS) is faster.

Training stage
In the training process of the proposed network, we have 160k iterations on our model. In each iteration, the region proposal network has two types of predictive value outputs, binary classification and bounding box regression adjustment. The feature maps of all positive and negative sample regions are input into the loss function, and two output layers are obtained by iteration. The first layer outputs the probability score of the object vehicle in each predicted region, which can be obtained by a softmax classifier. The second layer outputs the coordinate vector of each prediction region after bounding-box regression and fine tuning. Similar to [10], the loss function is defined as shown in Equation (4): where i is the index parameter of mini-batch. p * i is the ground truth label, if the region box is positive p * i = 1, if the region box is negative p * i = 0. L cls is a softmax loss function and indicates the classification of the vehicle and background. L bbr outputs the coordinates vector loc = (x, y, w, h) in each predicted region, where x and y represent the top-left coordinates, w and h denote the width and height. In each iteration, we process a batch for training data containing the same number of the positive and negative region box. N cls is normalised by the mini-batch size, and N bbr is normalised by the number of anchor locations.
We set =2 to make L cls and L bbr have the same weight. Moreover, L bbr denotes a smooth L 1 loss, which is the same as that in Fast R-CNN [10]. It is defined as Equation (5): In order to suppress the redundant boxes, we need to use NMS algorithm. In this algorithm, the suppression is an iterativetraversal-elimination process, its purpose is to remove redundant detection boxes and retain the best one.

Training details
The proposed network can be trained by stochastic gradient descent. In order to prevent over-fitting of training model, we adopted the pre-trained PeleeNet model for VOC 2012 to initialise the stem block. In addition, the parameters of weight are randomly initialised in the new layer from a zero-mean Gaussian distribution with a 0.01 standard deviation. In the training process, we use a weight decay of 0.0001 and a momentum of 0.9. The learning rate is set as 0.001 for the first 100k iterations, and 0.0001 for the next 80k iterations. The RPN batch size is set to 256. Then we generate approximately 300 overlapped candidate region boxes. In order to suppress those redundant boxes, NMS algorithm is adopted on the proposed regions based on the confidence score.

EXPERIMENTAL RESULTS
In this section, we present the results of our model in vehicle detection. Meanwhile, the experimental results are analysed in detail and compared with other state-of-the-art detection algorithms.

Dataset description
In the experiment, we use two aerial datasets to evaluate our model. The first dataset is a public vehicle dataset, which is collected over the city of Munich, Germany. The other one is our collected vehicle dataset, which is captured over the city of Nanjing, China. These two datasets contain a large number of highresolution aerial vehicle images.
(1) Munich vehicle dataset: The Munich vehicle dataset is a public dataset that having been widely used by many researchers to evaluate the performance of aerial vehicle object detection [3][4][5]. The images are captured from an airplane by a Canon Eos 1Ds Mark III camera with a resolution of 5616 × 3744 pixels, 50 mm focal length and they are stored in JPEG format. The Munich vehicle dataset is collected by Liu and Mattyus [2] and can be downloaded from [45].
In order to overcome the limitation of dataset on training, each original aerial image (5616 × 3744 pixels) is cropped into 11 × 10 image blocks (702 × 624 pixels) with overlap. The data augmentation is divided into three steps. In the first step, we add Gaussian noise to each image block to double the original dataset. In the second step, we enlarge the image by 10% and reduce it by 10% to expand the dataset. In the second step, we enlarge the size of image block by 10% and reduce it by 10% for data augmentation. In the third step, we rotate the image blocks with four angles (i.e. 45 • , 135 • , 225 • , and 315 • ). Finally, we expanded the number of image blocks by 24 times.
(2) Our collected dataset: The collected vehicle dataset contains 615 high-resolution aerial images, and the pixel of image is 1368 × 770. The UAV captures these images at a height of 60 m. The vehicle information is annotated on each image. There are approximately 30 car-type samples annotated in every image, so our collected dataset contains about 18,450 samples with the ground truth. We choose 80% of the data as training samples and the remaining data as testing samples. In the training stage, the original image is rotated vertically, horizontally and mirrorly to expand the dataset and to prevent over-fitting. Most of the images are captured from aerial photographs of city roads. Our collected dataset is captured over the city Nanjing, China. We have uploaded the dataset to the public repository, and readers can download the dataset from the link in [45].

Evaluation index
In our experiments, four typical indicators are used to evaluate the detection performance, namely, precision rate, recall rate, mAP and F 1-score .
The precision rate is for the prediction result, which indicates how many of the samples are predicted to be truly positive. The definition of precision rate is shown in Equation (6): where true positive (TP) represents the number of positive samples predicted to be positive, and false positive (FP) represents the number of negative samples predicted to be positive. The recall rate is the ratio of the number of samples detected to the total number of samples, which is an important indicator to measure the detection performance. It is defined as Equation (7): Ideally, the precision and recall achieve the desired results at the same time, but in fact these two indicators are contradictory in some cases. In different situations, we need to judge which one we want it to be higher. At this time, it is necessary to have an evaluation index of comprehensive recall rate and precision rate. The F 1-score is an important indicator used to measure the detection performance in object detection, which takes the precision and recall rate of the classification model into account. The definition of F 1-score is shown in Equation (8): The mAP is a detection index proposed to solve the single point value limitation of precision rate, recall rate and F-measure, which can comprehensively reflect the global performance. The definition of the mAP is as follows (where P and R represent the precision rate and recall rate, respectively): The intersection-over-union (IoU) represents the ratio of the overlap between the prediction area and the ground-truth box. If the IoU of the predicted region is higher than 0.6, we assign a positive label to it. Whereas, if the IoU of a predicted region is lower than 0.3 for all ground-truth box, we assign a negative label to it. Then, the remaining regions are disregarded. The IoU ratio is defined as follows:

Results on Munich vehicle dataset
In this study, we compare the results of our proposed method with those of the methods in reference. We do not copy the   Table 2 shows that the effects of various design choices and components of performance. The design choices of the stem block, small convolution kernel and transition layer with the same channels have a contribution to the improvement of the detection accuracy and the reduction of computational consumption of the model. The components of the three-way dense layer and appropriate anchor box have an effect on the improvement of the model detection accuracy.
Moreover, in our experiments, we find that although the appropriate anchor box is not helpful to model computing consumption, this design can significantly reduce detection time. The appropriate anchor box can accelerate the iteration of the model, and make the model locate the position of the detected vehicle quickly and accurately. It can be seen that after combing all these design choices and components, our proposed detection network achieves 85.8% precision and 242 million MAC on Munich vehicle dataset, which achieves state-of-the-art performance in terms of accuracy and computational cost.
As we can see from Table 3, our method achieves a com-pelling detection result on Munich vehicle dataset. The mAP is 85.8%, which is 6.0% higher than that of PeleeNet. Meanwhile, the computational cost and model size of the proposed network is only about half of PeleeNet. The accuracy is also higher than that of ResNet50 and DenseNet-121. Moreover, the computational cost of our detection network is only 6.3% of the cost of ResNet50 and 8.5% of the one in DenseNet-121.
In Table 4, we compare our method with state-of-the-art aerial vehicle detection models in terms of detection accuracy and time. As shown in Table 4, our proposed method achieves state-of-the-art performance in terms of recall rate, precision rate, F 1-score and detection time. Compared with Yu's method [6], our model shows superior performance in F 1-score and mAP, while the detection time is only 65.2%.
Our method achieves a recall rate of 81.7% on the Munich dataset. At the same time, the precision is 91.3%, the F 1-score is 0.862, the mAP is 85.8% and the detection time is 1.78 s per image. These indicators currently has reached the leading level. In addition, our model has a training time of 0.17 s per iteration on the Munich dataset. The model has been iterated 160k times, so the total training time is about 7.5 hours.
NMS is an algorithm that removes non-maximum values which is an important parameter in object detection. In object detection, after feature extraction and classification recognition, each sliding window will get a score. But the sliding window caused many windows to be included or intersected with other windows. At this time, NMS is needed to select the candidate boxes with the highest neighbourhood score and suppress the windows with low scores. Figure 5(a) shows our model comparison with other state-of-the-art vehicle detection models in mAP-NMS curve. Compared with other models, the result of our model has a higher mAP when the NMS is different. In particular, the mAP achieves the highest value when NMS is near 0.35, so in our model we choose NMS = 0.35 in test. The precision-recall curve visually shows the recall and precision of the model in the sample population, which is a very important indicator to measure the performance of the detection model. Figure 5(b) shows the precision-recall curves of our proposed method and several other methods. It is not difficult to see from Figure 5(b) that our method has obvious advantages over other vehicle detection methods in precision-recall. This further illus-  trates the excellent performance of our proposed method in vehicle detection. Figure 6 shows the results of our method on the Munich test dataset. The red boxes denote the correct localisation, and the yellow boxes represent missing detection. In Figure 6, we select those image blocks that are more difficult to detect to demonstrate the detection performance of our model. As shown in Figure 6, Figure 6(a) shows the scene in which the object vehicle has a large density and many vehicles need to be detected, Figure 6(b) shows the vehicle accompanied by a complicated background, Figure 6(c) is the scene where the object vehicle has occlusion and shadows. Figure 6(d) is the local detection effect of Figure 6(c). Figure 6(a) shows some missing samples at the boundary of the image block. This is due to that the part of the information and features of the vehicle appear at the boundary, which makes the detection more difficult. In fact, the image blocks in the test dataset are partially overlapped at the boundary when the original image is segmented, so the samples of the boundary are substantially not missed when the detection of the image block is merged into the original image. In Figure 6(b), a small number of missed samples appear in the shaded area. This is due to the influence of shadows on the extraction of object features, espe-cially when the colour of the vehicle is darker or the occlusion is accompanied. In Figure 6, it shows the detection results of four hard-to-detect image blocks. In this case, the overall detection accuracy reaches 96.1%, which shows that our proposed detection model has superior robustness and accuracy.

Results on our collected dataset
In order to further demonstrate the robustness of our method, we also have evaluated the algorithm on the collected vehicle dataset. Our model spends 0.193 s per training session on the collected vehicle dataset, the number of iterations is 160k times, so the total training time is about 8.6 hours. Table 5 shows the numerical results of several methods. Our method reaches a F 1-score of 0.923 on the collected vehicle dataset, at the same time the recall rate is 91.4%, the precision rate is 93.2%, and the detection time is 0.048 s per image. These indicators are significantly better than other most advanced detection methods, which demonstrate that our method is more robust and accurate than other methods.
As shown in Figure 7, our model can successfully detect vehicles in a variety of complex backgrounds. More details, because

FIGURE 7
Detection results for our collected dataset the collected vehicle datasets have higher pixels than the Munich dataset, the vehicles detected on the collected dataset has higher scores, and most of the detected vehicles scores more than 0.95, which indicates that the detection results have higher reliability.

DISCUSSION
In this study, compared with the lightweight network PeleeNet, the proposed network increased by 6.0% on mAP in the Munich dataset, while the computational cost and model size are decreased by approximately 50%. Compared with the most advanced aerial vehicle detection of Yu's method, the proposed method has increased the recall rate and the precision rate by 1.1% and 0.9%, respectively, but the detection time is only 65.2% of this method. All of these indicate that the proposed method has reached the most advanced level in the field of aerial vehicle detection. Although our detection model achieves good performance on vehicle detection in aerial image, there are still some flaws in this model. The first is in hard example detection, for exam-ple, when some vehicles in aerial image are partially occluded or in the shadow, our detection model is not able to perform effective detection. In addition, we only detect car-type vehicles in this study, the detection of fewer vehicle types is also a drawback that exists in our model. For future study, we will pay attention to extract the features of the hard example vehicles, and improve the detection effect of the hard examples. Moreover, we will increase the categories of vehicles, and make the model accurately detect the different types of vehicles, which is also the focus of our future study.

CONCLUSIONS
In this study, we propose a fast and accurate aerial vehicle detection model. In our model, we have established a lightweight network to extract the characteristics of the object vehicle. At the same time, we make some improvements on the feature extraction network for vehicle detection in aerial scenes. For example, transition layers with same channels and small convolution kernel for object feature extraction are introduced into the feature extraction network. We have combined the proposed object feature extraction network with the target detection algorithm Faster R-CNN, which makes the algorithm better for the detection of small objects in aerial scenes. In addition, we set the appropriate anchor box according to the characteristics of the objects in the proposed model, which further improves the accuracy and efficiency of the model. Moreover, we evaluate our proposed method in the Munich vehicle dataset and the collected dataset.