Multi-Scale Vehicle Detection in High-Resolution Aerial Images With Context Information

Recently, unmanned aerial vehicles (UAV) are widely used in many fields due to the low cost and high flexibility. One of the most popular applications of UAV is vehicle detection in aerial images which plays an important role in traffic surveillance and urban planning. Although, many deep learning based detectors have achieved state-of-the-art (SOTA) performance in natural images, the significant variation in object scales caused by the altitude change of the UAV platform brings great challenges to these detectors for precise localization of vehicles in aerial images. To improve the detection performance for vehicles with different scales, we propose a novel detection algorithm which consists of three stages. In the first stage, to reduce the distortion of vehicles during image resizing and keep more information of aerial images, we utilize an image cropping strategy to divide the image into two patches. In the second stage, we combine the original image and two patches into a batch and detect vehicles with a Convolutional Neural Network (CNN). For feature representation in our detector, we propose Scale-specific Prediction to strengthen the multi-scale features of vehicles with context information. In the final stage, to fuse detections and suppress false alarms, we propose an Outlier-Aware Non-Maximum Suppression. Extensive experiments are conducted to demonstrate the superiority of the proposed algorithm by comparison with other SOTA solutions.


I. INTRODUCTION
The usage of UAVs is increasing rapidly recent years in plant protection [1], [2], traffic surveillance [3]- [5], disaster rescue [6], and urban planning [7]. One of the popular topics in the field of object detection in UAV images is vehicle detection [8]- [12]. The distribution of vehicles in different districts can provide essential information for traffic supervision and urban planning. However, aerial images may be taken at different altitudes, thus making vehicles present at different scales in different images as shown in Figure 1. For images captured from high-altitude UAVs, the ground sampling distance (GSD) is usually higher than 0.1 meter and the size of a standard vehicle is typically with 48 × 16 pixels [7]. For images captured from low-altitude UAVs, the GSD may reach the centimeter level and the typical size of The associate editor coordinating the review of this manuscript and approving it for publication was Orazio Gambino . a vehicle is 180 × 80 pixels [13]. Therefore, learning an effective representation of vehicles with different scales is an essential challenge for vehicle detection in aerial images.
Recently, deep-learning based detection models have achieved the SOTA performance on generic object detection in natural images [14]. There are two main frameworks for detection tasks: two-stage detectors and one-stage detectors. Regions with CNN features (RCNN) [15] and its variants [16]- [19] are typical two-stage detectors which first propose potential object regions, followed by bounding box regression and object classification. Whereas, YOLO (You Only Look Once) [20]- [22] and SSD (Single Shot Multibox Detector) [23] are typical one-stage detectors which embed region proposal and object classification in an end-to-end framework. However, these SOTA detectors may not achieve the same performance on UAV images due to their inherently differences from natural images. Furthermore, the vehicles are usually arbitrarily distributed in aerial images which VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ makes it difficult for these detectors to cover vehicles and results in IoU (intersection over union) distribution imbalance problem [24]. In addition, there are many small vehicles in UAV images [25] which makes it necessary to leverage more details of small vehicles in high resolution UAV images.
To tackle the problems in vehicle detection in highresolution UAV images, we propose a novel algorithm which mainly consists of three steps. The framework of our approach is shown in Figure 2. First, we utilize an adaptive image cropping strategy for images with aspect ratio deviating from 1. However, some vehicles may be divided into two or more parts. Therefore, we conduct detection on both the original image and the image slices. To strengthen the feature representation of multi-scale vehicles, we propose scale-specific prediction modules. After detecting on these images, there are many redundancies which have to be removed. Meanwhile, there are many false alarms while some of them are with extreme scales or aspect ratios. We consider these false alarms as outliers and propose OA-NMS to remove them. The main contribution of our work is as follows: 1) First, we analyse the IoU distribution imbalance problem in vehicle detection in high resolution aerial images and propose a novel adaptive framework to utilize more information of aerial images.
2) Second, we analyse the difficulty in feature representation of multi-scale vehicles in high resolution aerial images and propose a novel scale-specific prediction based single shot detector (SSP-SSD) to include more context information to strengthen the feature representation of multi-scale vehicles.
3) Third, we propose a novel Outlier-Aware Non-Maximum Suppression (OA-NMS) to fuse results and remove false detections.
The paper is organized as follows: in section II, we briefly review the related works of vehicle detection in UAV images. In section III, we elaborate the proposed approach. In section IV, some experiments are conducted to verify the performance of our method by comparing with some other SOTA models. Finally, we conclude our work in section V.

A. TRADITIONAL VEHICLE DETECTION
Traditional vehicle detection algorithms are mainly based on handcrafted descriptors including Local Binary Pattern (LBP), Harr-like features, Scale-Invariant Feature Transform (SIFT) and Histogram of Oriented Gradient (HOG) [13], [26], [27]. Kembhavi et al. [7] combined color probability maps of vehicles with surroundings, structural characteristics of objects called pairs of pixels and HOG features to represent vehicles. Zhou et al. [28] utilized local steering kernel features to describe vehicles and applied a sliding window strategy for vehicle detection in arbitrary orientations, which is very time-consuming. Moranduzzo and Melagni [29] detected vehicles in UAV images with different descriptors. They observed that the integration of the original SIFT features with color and morphological features had the best performance. A combination of diverse features including HOG, LBP and opponent histograms was proposed in [30] to detect cars in aerial images. Chen Ziyi [8] et al. proposed a multi-order feature descriptor which contains color, texture and high-order context information to describe vehicles. However, the background of aerial images is usually very complex, and parts of other objects (e.g., building roof, road marks, trash bins and electrical units) may appear similar to vehicles in some circumstances. These factors make it

B. DEEP LEARNING BASED VEHICLE DETECTION
With the development of deep learning, many CNN based algorithms have been proposed to strengthen feature representation of vehicles in UAV images. Audebert et al. [31] proposed a segment-before-detect algorithm to detect vehicles in very high resolution (VHR) UAV images. They utilized SegNet [32] to obtain a pixel-level semantic map followed by a CNN to classify instances. However, the semantic segmentation would cost a lot of time and lead to low efficiency. Tao et al. [33] proposed a similar framework and added additional post-processing step to improve the recall rate which is more time-consuming. Ammour et al. [9] segmented the image into small homogeneous superpixels for the prediction of vehicle candidate locations. They extracted image features with a pre-trained CNN network and applied a linear support vector machine (SVM) for region classification. This algorithm can not effectively deal with the scale change of vehicles. When each vehicle was completely segmented into a single superpixel, it could achieve good detection results. However, if the vehicles are so small that many vehicles are in one superpixel or the vehicles are so large that one vehicle is divided into many superpixels, it would be difficult to achieve good results. Lars Sommer et al. [34] modified the region proposal module of Faster RCNN (FRCNN) [17] according to the characteristics of UAV images. Yohei Koga et al. [35] utilized a hybrid deep CNN for vehicle detection and proposed a hard example mining algorithm to select the most informative samples for training. Li et al. [11] developed a unified framework for simultaneous vehicle detection and counting. They employed a scale-adaptive anchor generator and a feature extraction module to maximize the mutual information between vehicle classes and features. These algorithms utilized the feature pyramid to construct multi-scale feature representation for vehicles at different scales. In the feature pyramid, low-level features were used to represent small objects while high-level features were used to represent large objects. However, for small vehicles, low-level features VOLUME 8, 2020 can hardly provide enough information, which makes context information important for object representation. In addition, a large receptive field may introduce unnecessary background information, which inevitably degrades the feature learning performance for small vehicles [36]. Therefore, we propose SSP-SSD and utilize context information to strengthen the feature representation of multi-scale vehicles.
In addition, to keep more details of small vehicles in high resolution UAV images, some cropping methods have been proposed [25], [37], [38]. Ünel et al. [25] simply cropped the images into many pieces and conducted detection on both the original image and the image slices. George et al. [37] proposed a selective tiling method to select the slices which may contain objects. Kouris et al. [38] proposed a density map prediction method for guidance of image cropping. These methods aim at achieving higher accuracy with sacrifice in time efficiency. With the increasing of the slice number, the method will be more and more time-consuming. In addition, these methods adopt NMS [39] to fuse results. However, except redundant detections, there are many false alarms. Kouris et al. [40] investigated the relation between the vehicle scales and UAV altitude and utilized it to remove false positives. However, this algorithm needs additional information e.g. the UAV altitude when taking the image and it should assume that the image looks at a flat area, which is not always the case for surfaces with different levels of elevation and changing UAV heights.
Therefore, to achieve the balance between the accuracy and time efficiency, we propose this simple but efficient image preprocessing strategy. In addition, we propose an effective OA-NMS to remove redundancies and false detections which does not need extra information.

III. METHODOLOGY
The framework of the proposed algorithm is illustrated in Figure 2. Our framework can be divided into three parts: Image Preprocessing (IP), Scale-specific Prediction and OA-NMS. In this section, we will describe these parts in details. Figure 3, image scaling may result in vehicle distortion as is shown in Figure 3. For one-stage detector, the pre-defined anchors are utilized to cover vehicle candidate regions. Normally, IoU [41] is used to determine the overlap between an anchor bb an i and the ground truth bb * j , which is defined in Eq.(1).

As shown in
where area(bb an i ∩ bb * j ) denotes the intersection area between the anchor and the ground truth while area(bb an i ∪ bb * j ) represents their union.
If the IoU is larger than a predefined threshold, the region covered by the anchor box bb an i can be selected as a vehicle candidate region. Therefore, the design of anchors with predefined scales and aspect ratios has a great influence on the detection performance. However, when a UAV image is scaled to fit the input size of the detector, the aspect ratios of vehicles would also undergo changes, as illustrated in Figure 3. For instance, the bus in the original image has an aspect ratio of 48 : 155 which approximately equals to 1 : 3.23. However, the aspect ratio of the bus becomes 1 : 5.74 after image scaling. Therefore, to fully cover these vehicles, we have to define a large number of extra anchors, while most of them are false candidates and may result in false alarms. Additionally, due to the arbitrarily distribution of vehicles, the predefined anchors have low IoU with a large number of ground truth. These two factors make the IoU distribution skewed towards lower IoUs and cause IoU distribution imbalance problem [24]. Therefore, to reduce the IoU distribution imbalance problem and keep more information, we propose an adaptive image preprocessing method in Figure 2. We define H I and W I as the height and width of the original image respectively. We utilize S h and S w to represent the height and width of image slices respectively. We calculate S h and S w according to Algorithm 1. To save computation, we limit the number of image slices to a maximum of two. Whereas, if the S w or S h is simply assigned with W I /2 or H I /2 respectively, some small vehicles may be divided into two parts and difficult to be detected in the original images. To solve this problem, we add δ to S w or S h to ensure that small vehicles remain intact in at least one slice as shown in Figure 2. For vehicles in 4K images, δ has little influence on the aspect ratio of vehicles in slice. For instance, after image cropping, the aspect ratio of the aforementioned bus is approximately equal to 1 : 3.55 after scaling if ξ equals 0.1. Therefore we adopt a uniform cropping strategy for all images.

B. SCALE-SPECIFIC PREDICTION
After image preprocessing, we apply a CNN to detect vehicles in the original image and image slices. The detection network is built on SSD and the architecture of the network is demonstrated in Figure 2. We utilize Resnet101 [42] as

Algorithm 1 Image Cropping Algorithm
backbone for feature extraction. The original prediction layer of SSD obtains the object's category and location directly through regression, as shown in Figure 4 (a). However, due to the lack of context information, low-level features often generate missing and false detections. Therefore, we propose SSP for vehicles with different scales.
First, we utilize feature pyramid to construct different levels of features. For small vehicles, we utilize low-level features which do not contain sufficient semantic information. Therefore, we add additional branches to include more context information as shown in Figure 4 (b). We utilize a 3 × 3 filter to include context information while avoiding introducing too much background information. In order to save computation expense, we utilize a combination of 1 × 3 and 3 × 1 filters instead of one 3 × 3 filter according to [43]. Furthermore, to strengthen features, a skip connection is adopted to fuse lower-level features and higher-level features. For large vehicles, we add an additional 5 × 5 filter to include more context information. Meanwhile, we add additional two 3 × 3 filters instead of one 5 × 5 filter to save computation expense as shown in Figure 4 (c). Before these filters with large receptive fields, we leverage 1 × 1 filters as transition layers to reduce the input channels and save computation burden.
The anchors of the prediction layers can be defined as During the training process, if the i th anchor has the highest IoU which is larger than 0.5 with the j th ground truth box bb * j , we consider the anchor as positive and assign the label y i = 1 and m i,j = 1. Meanwhile, it is appended to A + set. If an anchor's IoU with all ground truth boxes is lower than 0.2, it is predicted as negative and then appended to A − set. We select the 3M negative anchors with highest probability into the training set. Based on the above definitions, the loss function can be defined as: where M is the number of matched anchors, if M = 0, the loss is set to 0, bb } denotes the offset between an anchor and the corresponding predicted box. The classification loss L cls is the cross entropy loss [44] and is defined as follows: where p(x i ) represents the probability of an anchor belonging to vehicle while p(x k ) denotes the probability of the anchor belonging to backgrounds.
The localization loss L loc is a smooth L1 loss [16] between the predicted box and the ground truth. The localization loss is defined as follows: where {b * x j ,b * y j ,b * w j ,b * h j } is the offset between a ground truth and the corresponding anchor bb an i.e.,b * x . We utilize stochastic gradient descent (SGD) to optimize the detection network in our training process. In the inference stage, a UAV image and its slices are fed into the network for vehicle localization.

C. OUTLIER-AWARE NMS
After detecting vehicles on both original images and image slices, there are many redundant detections, which makes it necessary to merge all detections belonging to the same vehicle into a single detection. The most popular methods are NMS [39] and SoftNMS [45]. NMS starts with the detections BB with scores S and adopts Eq.(6) to rescore the detections. First, it selects the detection with highest score bb re hs . After removing it from the set BB, NMS appends it to the final detection set D and removes any results which have IoU scores greater than a threshold N T from the set BB. This process is conducted repeatedly until the set BB is empty. However, NMS sets a hard threshold when re-scoring boxes, which means that if two true positives have an overlap greater than N T , one with a lower score is deleted. Therefore, to solve this problem, Soft-NMS utilizes a Gaussian penalty function as a re-scoring function, in Eq. (7), to assign the box bb re j with a lower score if it has a high overlap with bb re hs .
However, in aerial images, parts of other objects which appear similar to vehicles would also be mistaken as positive samples. For example, in Figure 5 we display that some false alarms with extreme scales or aspect ratio have low scores compared with true positives in the same image. The area ratio AR v of a vehicle v in image I is defined in Eq. (9).
where area(v) denotes the area of vehicle v, area(I ) denotes the area of image I which contains v. Therefore, we consider these false alarms as outliers and propose an OA-NMS method to remove them. We propose an outlier-aware re-scoring function defined in Eq.(10). In OA-NMS, we utilizeγ andξ to represent the average area and aspect ratio of bounding boxes in one image. In different images,γ andξ have different values.
The penalty function is continuous and Eq.(13) gives its derivative. It can be seen that when x > 1, ∂f (x, η)/∂x < 0 and when x ∈ (0, 1), ∂f (x, η)/∂x > 0. As shown in Figure 6, the penalty function gets maximum value at x = 1 and there would be a high penalty when x deviates far from 1. Therefore, if a detection has an extreme size or aspect ratio compared with other detections in the same image, its f (γ /γ , α) or f (ξ/ξ , β) should have a small value and thus will be removed. However, according to Eq.(10), if a detection has a high score larger than 5 3 N t , it can be considered as a true positive and will not be removed. Algorithm 2 gives the pseudo code of the OA-NMS algorithm.

IV. EXPERIMENTS
In this section, we conduct several experiments to demonstrate the effectiveness of our algorithm on our captured dataset and UAVDT [46]. We utilized DJI Matrice 100 (M100) platform to obtain images for our dataset. All the experiments are conducted on a desktop computer with Intel  Core i7 4790k CPU, 32 GB memory and a NVIDIA Titan X GPU (with 12 GB graphics memory). The operating system is windows 7.

A. DATASETS 1) OUR DATASET
Our dataset contains 7760 aerial images and 4355 of them are used as training set while the rest forms testing set. There are totally 312071 vehicles and 171556 of them are included in training set and the rest vehicles are included in testing set. In COCO dataset [14], small object refers to the object whose size is smaller than 32 × 32. The size of a medium object is between 32 × 32 and 96 × 96, while the size of a large object is larger than 96 × 96. A typical resolution of an image in COCO dataset is 640 × 427, whereas the resolutions of most UAV images are 3840 × 2160. Therefore, it is not suitable to use the 'size' definition of objects with the standard of COCO dataset. The scale distribution of vehicles in our dataset is illustrated in Figure 7. The definition of area ratio is shown in Eq. (9). We utilize the area ratio of target to classify the objects into three categories: small objects, medium objects and large objects. We consider vehicles with area ratio in (0, 0.06%] as small vehicles. Vehicles with area ratio in (0.06%, 0.1%] as medium vehicles while the area ratio of large vehicles is in (0.1%, 1].

2) UAVDT DATASET
The UAVDT [46] dataset contains 40409 images and of training data and 23829 of them are used as training set while the rest forms testing set. The resolutions of images are about 1080 × 540 pixels. The categories in UAVDT are car, bus and truck. The quantity distribution of different categories is shown in Figure 8.

B. EVALUATION METRICS 1) OUR DATASET
Popular metrics including: precision [41], recall rate [41], F1-Score [47] and AP (average precision) are used for VOLUME 8, 2020  performance evaluation. AP refers to the area under the precision-recall curve. The definition of other metrics is shown in Eq.(14)- (16).
where TP denotes true positive which defines the total number of positive detections that are really positive, FN represents false negative which defines the number of vehicles that are not detected, FP indicates false positive which defines the number of false alarms.

2) UAVDT DATASET
Mean average precision (mAP) is utilized to evaluate the performance of each algorithm on UAVDT. It is the average APs of different categories. The mAP is high related to IoU. Therefore, mAP 95 50 , mAP50 and mAP75 is utilized the metrics. Specifically, mAP50 and mAP75 refer to mAP at IoU = 0.5

C. SENSITIVITY ANALYSIS
OA-NMS has three hyper-parameters including σ , α, β and a score threshold N t . We vary these parameters and measure recall, precision and F1-Score on the training set. The IOU threshold between candidate detections and ground truth is set to 0.5, which means that only if the overlap is higher than 0.5, the candidate detection is a true positive detection. First of all, we set ξ to 0.1, such that it can reduce the distortion of objects and repeated detections caused by image cropping. 208650 VOLUME 8, 2020  To find a fine parameter set, we vary these parameters from 0.05 to 0.95 and apply grid search to obtain the values as listed in Table 1.
To demonstrate the relation among these parameters and the metrics, we first set α and β to values given in Table 1. Under these circumstances, we obtain the influence of σ and N t on the model performance, as demonstrated in Figure 9. It can be seen that the smaller N t is, the higher the recall is. This means that we can obtain more true detections at a lower value of N t . Meanwhile, the smaller N t is, the smaller the precision is. It indicates that when the value of N t increases, the false detections/alarms tend to decrease. According to Eq.(8) and Figure 9, when σ increases, the OA-NMS scores tend to be less sensitive to IoU scores which result in a higher recall rate and smaller precision rate. In addition, we assign the values of N t and σ according to Table 1 and vary α and β from 0.05 to 0.95 to obtain Figure 10. It can be seen from Figure 6 and 10 that the relationship between the performance and the two parameters α and β is similar to that with σ .
SoftNMS has a hyper-parameter σ and its sensitivity on recall, precision and F1-score is shown in Figure 12 (a). It can be seen that the recall is stable when σ increases while precision drops significantly in [0. 3, 0.45]. It is mainly because that SoftNMS is less sensitive to IoU when σ increases and more redundancies are retained. NMS sensitivity against the variation of overlap threshold N T is shown in Figure 12 (b). As N T increases, recall also increases while precision decreases.
To achieve a trade-off between recall and precision, we set parameters shown in Table 1.

D. ABLATION STUDY 1) SSP EVALUATION
We propose SSP to construct effective representations for multi-scale vehicles. In Figure 11 and Table 2, we compare some detection results of SSP-SSD and those of the original  SSD. It can be seen that the SSP significantly improves the detection performance for small vehicles. Meanwhile, thanks to SSP, false alarms also decrease, which means the context information is able to enhance the feature representation of objects. The detection performance on large vehicles has been improved by SSP, but it is not as obvious as small vehicles. The main reason is that the detector utilizes high-level features to represent large objects which describe the object better than low-level features and weaken the contribution of context information.

2) EVALUATION OF DIFFERENT FUSION ALGORITHMS
We report the performance of different methods in Table 3 when overlap threshold is 0.5. It can be seen that the image preprocessing has a significant improvement on the recall value which means it is effective in dealing with IoU distribution imbalance problem. However, it also introduces many false detections. During the result fusion stage, SoftNMS achieves the highest recall with lowest precision among the three algorithms, which is mainly because it aims to keep more detections with high overlap while the intersection between the bounding boxes of most vehicles is relatively small in aerial images. In contrast, OA-NMS obtains the highest precision due to the removal of outliers. In terms of F1-Score, OA-NMS achieves the best performance.

E. PERFORMANCE EVALUATION 1) EXPERIMENTS ON OUR DATASET
In this section, we compare our algorithm with other SOTA methods including: SSD [23], Cascade RCNN [19], FRCNN [17], YOLOV3 [22], YOLOV4 [51], YOLOV5(x) [52], FCOS [49], Retinanet [50] and CenterNet [48]. Resnet101 is selected as backbone for most of these networks except that YOLOV3 adopts Darknet53 [22] as backbone while YOLOV4 and YOLOV5 adopt CSPDarknet [51] as backbone. We analyze the performance of each model using different overlap thresholds. The results demonstrated in Figure 14 indicate that our method obtains the highest recall, precision and F1-Score. In Table 4, we list the performance of each method when overlap threshold equals 0.5. It can be seen that our method achieves the best performance in terms of AP and F1-Score. As for computation efficiency, it can be seen from Table 3 that, it takes SSP-SSD 79ms to finish detecting the original image. Therefore, it will take SSP-SSD much more time to finish detecting the original image and two cropped patches serially. However, owing to parallel computation, it only takes our method 101ms to deal with one image which saves much time compared with serial computation.
In addition, we report the detection performance of each model on small, medium and large vehicles respectively, when overlap threshold is 0.5 (as shown in Table 5). For medium and large vehicles, these algorithms can achieve SOTA performance except CenterNet which mainly relies on single scale heatmap. However, for small vehicles, low-level features extracted by SSD lack the context information so they can not effectively represent small objects. Although YOLOV3 introduced high-level semantic information for feature representation of small vehicles, it may include too much background information. Owing to the cascade structure, the Cascade RCNN is able to learn high quality features for multi-scale vehicles which makes it perform better than FRCNN. It can be seen that our method can achieve SOTA performance on muti-scale vehicles.
Finally, we report the performance of other methods when utilizing OA-NMS to remove redundant detections (as shown in Table 4). It can be seen that OA-NMS can improve the precision of these methods while it will lead to decreasing in recall rate. It mainly because OA-NMS is a re-score method which is sensitive to the outliers with extreme scales or aspect ratios. Therefore, OA-NMS may remove some true detections with low scores when removing false detections. If there are a large number of outliers, the method will greatly improve the performance of the method in terms of F1-Score and AP, such as Centernet and YOLOV4. However, if the number of outliers is small, it would not obtain an obvious improvement in F1-Score and AP such as Faster RCNN, Casade RCNN and SSD. Even worse, the F1-Score and AP may decrease a little such as FCOS, Retinanet, YOLOV3 and YOLOV5.

2) EXPERIMENTS ON UAVDT DATASET
We compared our method with other SOTA methods on UAVDT dataset including: R-FCN [18], RON [53], SSD [23], Faster RCNN (FRCNN) [17], ClusDet [54] and DMNet [38]. All experiment results of other SOTA methods are obtained 208654 VOLUME 8, 2020  from [38] and the detection performance is illustrated in Table 6. It can be seen that our method outperforms the SOTA methods except DMNet. However, it will cost DMNet 290ms to process one image which is much slower than the proposed method. In addition, it can be seen that the performance of these detectors on UAVDT [46] is much worse than that on our dataset. The main reason of that is the unbalanced data.

V. CONCLUSION
Vehicle detection in aerial images remains a challenging problem due to IoU distribution imbalance problem and the difficulty of learning feature representation of vehicles under different scales. Therefore, in this paper, we propose a robust vehicle detection model for aerial images. First, we perform image preprocessing to deal with IoU distribution imbalance problem and greatly improve the recall rate. Then, we propose SSP-SSD to enhance feature representation of vehicles with different scales and improve the precision. From the experiments, it can be seen that the SSP-SSD is able to learn effective features for multi-scale vehicles. To reduce the detection time, we combine the original image and its patches into a batch and utilize parallel computation to accelerate the detection process. However, it still cost much time compared with detecting only on the original image. The main reason is that the computation burden increases when dealing with three images. Therefore, a more efficient backbone can be designed to improve the time efficiency. Finally, we propose OA-NMS to fuse results and remove false detections with extreme scales or aspect ratios. Although OA-NMS can remove many false detections, it can also remove some true positives if they have low scores and extreme scales or aspect ratios. In the future, we will focus our research on few-shot learning to improve the performance of our detector on unbalanced data.