VAID: An Aerial Image Dataset for Vehicle Detection and Classification

The availability of commercial UAVs and low-cost imaging devices has made the airborne imagery popular and widely available. The aerial images are now extensively used for many applications, especially in the area of intelligent transportation systems. In this work, we present a new aerial image dataset, VAID (Vehicle Aerial Imaging from Drone), for the development and evaluation of vehicle detection algorithms. It contains about 6000 images captured under different traffic conditions, and annotated with 7 common vehicle categories for network training and testing. We compare the of vehicle detection results using the current state-of-the-art network architectures and various aerial image datasets. The experiments have demonstrated that training the networks using our VAID dataset can provide the best vehicle detection results. Our aerial image dataset is made available publicly at http://vision.ee.ccu.edu.tw/aerialimage/ and the code is available at https://github.com/KaiChun-RVL/VAID_dataset.


I. INTRODUCTION
Nowadays, the availability of low-cost image acquisition systems and easy-to-use unmanned aerial vehicles (UAVs) has made the aerial imaging more convenient and popular. It is now possible to acquire a large number of high-quality aerial images without elaborate planning and a considerable amount of time. The aerial images have been adopted in many tasks such as cartography, precision agriculture, landscape archaeology and urban studies for many decades. One specific application is to detect and classify the vehicles in aerial images. It is gradually adopted to intelligent transportation for vehicle identification, traffic flow estimation and parking space allocation, etc. Thus, it is the future trend to use aerial images for transportation and vehicle related applications.
The aerial images are able to cover a variety of scenes from the sky, consisting of forests, rivers, buildings, bridges and roads, etc. In remote sensing applications, various kinds of satellite imagery are used in the fields of geography, land surveying and many earth science disciplines. They are also The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Shorif Uddin . frequently used for the detection of man-made structures, both static constructions and movable targets such as vehicles and vessels. Due to the recent progress on machine learning techniques, we are now able to achieve high object detection rates in cluttered scenes. The detection and classification of vehicles using aerial images have become more feasible with deep neural networks.
The techniques for vehicle detection using aerial images can be classified into two categories, the conventional machine learning methods and the deep learning approaches [1]. For the machine learning methods, lowlevel image features such as edge, corner, shape, texture and color are extracted for training and classification. Shao et al. propose a vehicle detection framework which use local binary patterns combined with histograms of oriented gradient for vehicle detection [2]. The differences in color are used for detection with the blob-like areas extracted from prominent color and grayscale features [3]. There also exist traditional computer vision techniques which use frame difference [4] and optical flow [5] for moving vehicle detection.
For the deep learning approaches, convolutional neural networks (CNNs) have significant improvement on object VOLUME 8, 2020 This  [16]. Compared to the general object detection tasks, there are additional issues for the vehicle detection in aerial images as follows.
• The target size is usually much smaller.
• The targets tend to have monotonic appearance.
• The images are easily affected by illumination changes. • There might be a large number of vehicles in an image.
• The target aspect ratio could be large. In this paper, we introduce a new aerial image dataset,VAID (Vehicle Aerial Imaging from Drone), for vehicle detection and classification. Extended from the previous work using modified Faster R-CNN [17], we compare the advantages, disadvantages and results of vehicle detection in aerial images with several well-known network architectures. Figure 1 shows the system flowchart of the proposed framework for the evaluation of vehicle detection algorithms. It consists of creating our VAID image dataset, and training and testing on the aerial images using various network structures for comparison.

II. RELATED WORKS
Due to the applications in traffic control, parking management, and security purposes, the detection of vehicles in aerial images has been studied for many decades [18]- [20]. Compared with the vehicle detection from close range or ground viewpoints, the technical requirements are very different since the targets are much smaller and contain less features to distinguish from the environment [21], [22]. The image quality is also degraded in general due to the long range acquisition in the atmosphere. To detect and recognize objects from the air, remote sensing is one of the earliest research fields which adopt the image-based approach [1]. Many techniques have been developed for a variety of applications, and are not restricted to the detection of ground objects. It is then followed by the computer vision community to investigate the object detection or specifically vehicle detection algorithms in airborne images.
Prior to the popularity and success of deep neural networks adopted for object detection and recognition, conventional machine learning methods heavily rely on hand-crafted feature extraction for image classification. When applied to the vehicle detection from aerial images, commonly used features including shape, color, corner, texture, disparity, as well as histogram of oriented gradient (HOG) and scaleinvariant feature transform (SIFT). They are combined with various classifiers such as support vector machine (SVM), random forest (RF), AdaBoost, and bag-of-words (BoW) for detection and recognition [23]- [25]. Although more recent works on aerial image analysis have gradually moved to deep learning based approaches, there still exist newly proposed conventional methods because of the low complexity and computational cost. Nevertheless, these techniques are designed for some specific uses rather than general purposes.
Among the few noticeable improvement for traditional methods, Chen et al. present a fast classification algorithm using a set of sparse representation dictionaries [26]. A multiorder descriptor is proposed to extract the vehicle feature in aerial images. By introducing the superpixel segmentation and patch orientation, their results on high-resolution images are superior to those obtained from commonly used HOG+SVM, LBP+PLS (Local Binary Patterns and Partial Least Squares), and sparse representation methods. Xu et al. proposed an enhanced Viola-Jones detector for vehicle identification from aerial imagery [27]. A road orientation adjustment stage is adopted to improve the original isotropic detection results. The method is further applied to improve the accuracy of vehicle tracking. Liu et al. also start the design of a vehicle detector from the orientation issue [28]. They develop a fast oriented region search algorithm to detect the position, size, and orientation of an object. A modified vector of locally aggregated descriptors is used to represent an object and distinguish the proposals from the background. The experiments carried out on public datasets, VEDAI and Munich 3K, have shown some significant results compared to the existing approaches. For training data collecting and labeling, Cao et al. propose an efficient TABLE 1. A summary of the aerial image datasets currently available and used for our evaluation and comparison of vehicle detection and classification algorithms. It shows the number of images in the dataset, the image resolution, the actual scale of a pixel, and the typical size of a vehicle in pixel. Some images from the datasets are shown in Figure 2. and labor-light scheme which only works on region-level group annotation [29]. A weakly supervised, multi-instance learning algorithm is developed to learn the weak labels. A multi-instance SVM is then trained to classify from the density map derived from the positive regions. To deal with the scale and orientation variations, shadow, and partial occlusion, Cao et al. present an affine-function transformation-based object matching framework [30]. Similar to the previous approach, superpixel segmentation is adopted to generate non-redundant patches, followed by detection and localization with a threshold matching cost. Their results obtained from two UAV image datasets demonstrate that good performance can be derived comparable to Faster R-CNN.
With the recent success of convolutional neural networks for object recognition, they have also been applied to aerial images for vehicle detection. Since the target size is one major issue for aerial imagery, the algorithms often need to emphasize the capability of small object detection. In [31], Zhong et al. propose a method which cascades two convolutional neural networks to improve the detection accuracy without decreasing the speed. The first network is used to generate a set of vehicle-like regions, followed by the second network for feature extraction and decision making. They adopt multi-feature maps with different hierarchies and scales, and achieve high recall rates and low computation costs on two public aerial image datasets. Mandal et al. propose a single-stage detector, AVDNet, specifically designed for small-size vehicle detection in aerial images [32]. The feature vanishing problem for small objects is mitigated by the use of residual blocks at multiple scales. Their algorithms are evaluated on four datasets, and a better performance compared to the well-known frameworks such as YOLOv3, Faster R-CNN and RetinaNet is reported. For the applications which require in situ real-time processing, He et al. present a compressed MobileNet capable of 110 fps processing speed [33]. It is built on the light weight network MobileNet and considers the tradeoff between accuracy and computation. Their algorithm is also implemented on a mobile phone with acceptable 15 fps inference speed. With the similar objective to reduce the hardware requirement, Ringwald et al. evaluate several popular detection frameworks for best accuracy/speed trade-off [34]. They build upon SSD to construct a network, UAV-Net, for aerial imagery. The impressive 0.4 MB model size makes it suitable for real-time operations on an embedded platform such as Jetson TX2.

III. VAID AND AERIAL IMAGES DATASETS
Currently, there are not many public datasets available for vehicle detection in aerial images. Some datasets, such as VIRAT video dataset, are designed for video surveillance and action recognition [35]. For the existing aerial image datasets, there are also some problems such as containing only a very limited number of categories, imprecise bounding boxes, small image sizes, etc. Several popular datasets for vehicle detection in aerial images include VEDAI, COWC, DLR-MVDA, DOTA and KIT-AIS. The description of these datasets are shown in Table 1. VEDAI (Vehicle Detection in Aerial Imagery) dataset is made available by Razakarivony and Jurie [36], and originated from the public Utah AGRC database. 1 It contains a total of 1,250 RGB and NIR images with the resolution of 512 × 512 and 1024 × 1024 captured at about the same height. The dataset is manually annotated with 9 classes of objects ('plane', 'boat', 'camping car', 'car', 'pick-up truck', 'tractor', 'truck', 'van', and others) and a total of 2,950 samples. Each image consists of 5 vehicles in average, and the vehicle size is about 0.7% of an image. The annotation of each sample includes the sample class, the center point coordinates, direction and the four corner point coordinates of the ground-truth. The targets in VEDAI are relatively easy to identify. Most of the vehicles in the images are sparsely distributed with simple backgrounds, and the vehicles in the densely distributed places such as parking lots are excluded.
COWC (Cars Overhead With Context) dataset created at LLNL contains the overhead imagery collected from six major cities [37]. All images are standardized to 15 cm per pixel at ground level, so the vehicles span about 24 to 48 pixels. The objective of this dataset is mainly for vehicle counting, so the annotation is different from the datasets for vehicle detection and classification. The labeled images in COWC dataset only mark the center point of a vehicle with a red dot. It does not provide the category or bounding box information. There are totally 32,716 annotated vehicles in the dataset, with additional 58,247 negative samples. In DOTA (Dataset for Object deTection in Aerial images) dataset, 2,806 aerial images from different sensors and platforms are collected at the resolution of 4000 × 4000 [38]. It contains more than 188k instances with different scales, orientations, shapes, and labeled by quadrilaterals instead of commonly used bounding boxes. Although the dataset is large in terms of the number of images and instances per image, it aims to provide for general purpose use with only two vehicle classes (large and small) out of the total 15 categories. This makes it unsuitable for object detection on vehicle specific applications.
DLR-MVDA dataset contains 20 large scale-aerial images [39]. The images are captured with more realistic road scenes and the vehicle detection is more challenging. KIT-AIS is a dataset with the images taken from an airplane at about 330 m above the ground [40]. It has 228 high resolution  images (5161×3744), but there is only one annotated vehicle category for network training.
This paper introduces a new vehicle detection dataset, VAID (Vehicle Aerial Imaging from Drone), with the aerial images captured by a drone. 2 We collect about 6,000 aerial images under different illumination conditions and viewing angles from different places in Taiwan. The images are taken with the resolution of 1137 × 640 pixels in JPG format. Our VAID dataset contains seven classes of vehicles, namely 'sedan', 'minibus', 'truck', 'pickup truck', 'bus', 'cement truck' and 'trailer'. Figure 2 shows some example images from our VAID dataset as well as four other datasets, VEDAI, DLR-MVDA, KIT-AIS and COWC. It can be seen that the vehicles are much smaller compared to the objects in general recognition and classification datasets.
Although the vehicles are divided into the seven categories according to the popularity in Taiwan's road scenes, it is sometimes very tricky to annotate. The characteristics of small sedans viewing from the above are less obvious, and the types are more diverse, including two-door and four-door sedans, five-door hatchbacks, recreational vehicles and nineseat vans. There are a few differences in the definition of a truck and a pickup truck for annotation. A truck is defined as a vehicle with a shelter in the cargo area or a vehicle with its own cargo area as a container, and the body and the front of the vehicle are completely disconnected. However, a pickup truck is not covered by the canopy. A minibus is a 21-seat medium size bus, while a bus includes passenger and big buses. The trailer category includes tank trucks, gravel trucks, tow trucks, container trucks with detachable tailgates. The images in the dataset are annotated using the labeling tool LabelImg in the format of PASCAL VOC, including 2 VAID Dataset: http://vision.ee.ccu.edu.tw/aerialimage/ the names of the classes and the bounding box coordinates. Figure 3 shows several cropped vehicle images from different categories.
The images in the dataset are taken by a drone (DJI's Mavic Pro). To keep the sizes of the vehicles consistent in all images, the altitude of the drone is maintained at about 90 -95 meters from the ground during video recording. The output resolution is 2720 × 1530 at 2.7K and the frame rate is about 23.98 fps. For an average sedan with the length of 5 meters and the width of 2.6 meters, the apparent size in the image is about 110 × 45 pixels. In the VAID dataset, the images are scaled to the resolution of 1137×640, and a sedan in the images is about the size of 40 × 20 pixels.
The dataset covers ten geographic locations in southern Taiwan, and contains various traffic and road conditions. The images are taken on the sunny days when the light is sufficient, the interference caused by the shadow of the house in the afternoon, and the darker imaging condition in the evening. Figure 4 shows some of the dataset images with various road and traffic scenes. There are totally 7 categories for vehicle classification in our VAID dataset. The images VOLUME 8, 2020   are divided into 3 regions, namely, urban area, suburb and university campus. Some statistics are shown in Table 2. Another important statistic regarding the distribution of the object's aspect ratio is shown in Figure 5.

IV. EXPERIMENTS AND EVALUATION
To evaluate the effectiveness of the proposed VAID dataset, two experiments are carried out with different object detection techniques and several aerial image datasets. First, our VAID dataset is used to train five popular object detection architectures, including Faster R-CNN, YOLOv4, MobileNetv3, RefineDet and U-Net, for performance comparison. The network architecture with the best performance for vehicle detection and classification in this experiment is considered for further evaluation. Second, the selected network structure is trained separately using different aerial image datasets, including VEDAI, DLR-MVDA, COWC, KIT-AIS and VAID. The trained neural network models are then tested on a new dataset for performance evaluation. It provides the comparison on the effectiveness of the training sets. The hardware used for the evaluation is a PC with an Intel i7-8700k CPU, 16GB RAM and Nvidia GTX1080Ti GPU. The software tools for the development include In our VAID dataset, there are totally 5,985 aerial images with the vehicles classified into seven categories. It is split into three parts, with 1,512 images for training, 1,534 images for validation, and 2,939 images for testing. Table 3 shows the detailed information for each class in training, validation, and testing sets. It can be seen that the number of training samples is unbalanced among the classes. Thus, training the network with fewer samples is an important issue to achieve better classification results. We use our modified Faster R-CNN model as the baseline for benchmarking. First, the ReLU (Rectified Linear Unit) activation function is used on the RPN (Region Proposal Network) layer. As shown in Table 4, this provides slightly better results compared to the original network and the modifications with other activation functions. Second, we replace the feature extraction model with ResNet-101. Finally, the aspect ratio is changed from [0.5, 1, 2] to [0.2, 0.5, 1, 1.2, 2]. Our modified Faster R-CNN architecture is illustrated in Figure 6. For the evaluation of other network models (YOLOv4, MobileNetv3, RefineDet, U-Net), we use the default settings without further changes.
The network model evaluation on the VAID dataset is tabulated in Table 5. It shows the mAP (mean average precision), precision, recall and F-1 score for Modified Faster R-CNN, YOLOv4, MobileNetv3, RefineDet and U-Net. 3 Figure 7 shows the vehicle detection results of a parking lot image using different network models. There are several important observations from the network outputs and evaluation results. First, Modified Faster R-CNN has 90.12% mAP but with very low precision. This is due to a large number of incorrect predictions of the 300 anchor boxes in the network model. Second, U-Net reports very high precision but only with a relatively low mAP (at 85.38%). It is caused by the use of pixel-level segmentation to define the bounding box  for U-Net, which reduces the number of false detection. However, if the objects are very close to each other, they tend to be considered as a single large target as shown in Figure 7(e). Third, MobileNetv3 has the lowest mAP among all network models. As indicated in Figure 7(d), it cannot deal with the nearby objects very well. The main problem is the feature map extraction. For other models, including YOLOv4, RefineDet and U-Net, the next higher dimension feature map is used to regenerate the feature map. However, the use of the raw feature map makes MobileNet hard to distinguish the object features, and have the bounding box regression perform well. Finally, YOLOv4 provides the best performance in terms of mAP, precision, F1 score (and with the recall slightly worse than Modified Faster R-CNN), and is selected for the experiments on the dataset evaluation. In general, all network models perform fairly well for the vehicle detection. However, if the viewing angle of the camera with respect to the ground is too large, all models cannot provide good results.
In the second experiment, we evaluate the aerial image datasets DLR-MVDA, VEDAI, COWC, KIT-AIS and VAID using YOLOv4 for vehicle detection. The network is trained VOLUME 8, 2020 using the individual image datasets separately and tested on a new dataset (with the aerial images acquired from different places) for performance evaluation. Because COWC and KIT-AIS provide only one category ('vehicle'), we modify the labels of all datasets to a single vehicle class as a basis for comparison. If an image is larger than 1137 × 640, it is cut to several 1137 × 640 sub-images for processing. Some classes which are not vehicle related such as 'boat','plane' and 'other' in VEDAI are removed from the dataset. The annotation in COWC only provides a dot on the center of a target, so we set a 20 × 20 bounding box on each object for IoU (Intersection over Union) computation.
The new testing data for the evaluation of different network models are selected from four other image acquisition scenarios. Figure 8 shows some example images in the testing dataset. Scene A consists of the aerial images acquired from two different locations in a city (see Figure 8(a)). Scenes B, C, D are the airborne traffic scene videos obtained from YouTube, which are recorded above two highways and one expressway in Taiwan, and a crossroad in Belarus. As shown in Figure 8(b), the highway images in Scene B contain several roads in different altitudes, and the objects may have different scales even belong to the same category. In Scene C, as illustrated in Figure 8(c), there are some vehicles parking on the roads with different orientations. The images in Scene D consist of the road scenes acquired in Belarus, with the vehicle size larger than those in the training dataset (see Figure 8(d)). Table 6 shows the evaluation results of different scenes (A, B, C and D) using VAID, VEDAI, DLR-MVDA, COWC and KIT-AIS as training datasets. The details and specifications of the training and testing data are also provided. Although the IoU threshold for VAID, VEDAI and KIT-AIS is 0.5, it is set as 0.25 for DLR-MVDA and COWC. This is due to the imprecise ground-truth bounding boxes for DLR-MVDA (too small) and COWC (too large), and the mAPs will be close 0 if the IoU of 0.5 is used. Figure 9 shows some example images of the detection results using different training datasets. Scenes A, B and C contain the road images acquired in Taiwan, and the vehicles such as trucks and trailers are rare in other datasets. This causes the classification problem for certain types of vehicles, and results in low mAP for VEDAI, DLR-MVDA and COWC. Using our VAID dataset for network training, high accuracy results are obtained for Scenes A, C and D. Our low mAP result of Scene B is mainly due to the much smaller vehicle size (about 20 × 10) compared to those in VAID (about 40×20) for training. In Scene B, the vehicles in the images are at different elevations (on the viaducts). Our dataset images are collected at approximately the same height, while other datasets including KIT-AIS, MVDA, and VEDAI contain images taken at different heights. Moreover, KIT-AIS has the images not only acquired from multiple heights, but also similar to Scene B, as illustrated in Figure 2(d). Consequently, the networks trained using our dataset perform not as good as using VEDAI, DLR-MVDA and KIT-AIS in Scene B. Nevertheless, the overall accuracy for the network trained on our dataset provides much better performance.

V. CONCLUSION
In this paper, we present a new aerial image dataset for the development and evaluation of vehicle detection algorithms. VOLUME 8, 2020 The dataset contains 6,000 images captured under different illumination conditions, and are available for public access. To illustrate the effectiveness of our dataset, the performance evaluation of vehicle detection techniques is carried out on widely used network architectures and training datasets. The experimental results have demonstrated that training the deep neural networks using our VAID dataset can provide the best vehicle detection rate on an independent testing dataset. In the future, the aerial image dataset will be extended with diverse imaging conditions and maintained for public access and benchmarking.