Smart Traffic Monitoring Through Pyramid Pooling Vehicle Detection and Filter-Based Tracking on Aerial Images

Increased traffic density, combined with global population development, has resulted in increasingly congested roads, increased air pollution, and increased accidents. Globally, the overall number of automobiles has expanded dramatically during the last decade. Traffic monitoring in this environment is undoubtedly a significant difficulty in various developing countries. This work introduced a novel vehicle detection and classification system for smart traffic monitoring that uses a convolutional neural network (CNN) to segment aerial imagery. These segmented images are examined to further detect the vehicles by incorporating novel customized pyramid pooling. Then, these detected vehicles are classified into various subcategories. Finally, these vehicles are tracked via Kalman filter (KF) and kernelized filter-based techniques to cope with and manage massive traffic flows with minimal human intervention. During the experimental evaluation, our proposed system illustrated a remarkable vehicle detection rate of 95.78% over the Vehicle Aerial Imagery from a Drone (VAID), 95.18% over the Vehicle Detection in Aerial Imagery (VEDAI), and 93.13% over the German Aerospace Center (DLR) DLR3K datasets, respectively. The proposed system has a variety of applications, including identifying vehicles in traffic, sensing traffic congestion on a road, traffic density at intersections, detecting various types of vehicles, and providing a path for pedestrians.


I. INTRODUCTION
The technological advancement in remote sensing has increased its popularity and made it more widely available. Recently, several researchers have devoted their efforts to segmentation [1], object recognition [2], [3], [4] scene classification [5], [6], [7], [8] vehicle detection [9], and traffic control systems [10], [11], [12] via aerial and remote sensing (RS) imagery, the list of abbreviations is provided in Table 1. RS and aerial data could significantly boost traffic control, management, efficiency and effectiveness. Traffic management applications include sensing traffic congestion, classifying the different types of vehicles, identifying suspicious vehicles in traffic, and vehicle parking by making vehicle detection a prominent and essential problem in aerial imagery. Although vehicle detection is studied from closerange image data, aerial imagery gives significant information about environments and traffic objects.
The use of a traffic monitoring system is a viable option for reducing traffic jams. The primary function of the traffic monitoring system is to maintain traffic data, such as the number of cars, the kinds of vehicles, and the speed at which they travel. In order to effectively utilize the road network, estimate future transportation requirements, and enhance traveler safety, it conducts traffic analysis using the acquired data. Traffic monitoring systems are usually expensive to create, deploy, and maintain in most countries.
RS and aerial data could significantly boost traffic control and management efficiency and effectiveness. This article focuses on an exciting problem of vehicle detection for traffic monitoring systems using aerial imagery from drones and closed-circuit television (CCTV) cameras. Our work has proposed a novel idea of first segmenting the image, then detecting the vehicle and classifying it into various categories for effective traffic management. Initially, the aerial images are taken as input for semantic segmentation. Then a customized pyramid pooling module (CPPM) is incorporated for vehicle detection in the segmented image. Then, these detected vehicles after classification are grouped into seven categories. Finally, these classified vehicles are tracked by employing two different tracking mechanisms (Kalman filterbased vehicle tracking and kernelized correlation filtersbased vehicle tracking). Furthermore, the presented model is validated through the experiments performed over Vehicle Aerial Imagery from Drone (VAID), Vehicle Detection in Aerial Imagery (VEDAI), and German Aerospace Center (DLR3K) datasets. The experiments demonstrated remarkable detection and classification accuracy over other state-ofthe-art (SOTA) methods.
The most significant contributions of this work are listed as follows: • We proposed a novel hybrid framework to detect, classify and track vehicles on roads for efficient management of transportation systems in rural and urban areas.
• A novel vehicle detection via a customized pyramid pooling (CPPM) module is devised for robust traffic monitoring. • Two different filter-based tracking approaches: Kalman filter and kernelized filter-based tracking are implemented for vehicle tracking.
• Compared to existing techniques, we have significantly improved the performance metrics including detection rate, precision, recall, F1 Score, and mean accuracy precision for the classification of vehicles.
• The efficiency of the proposed model has been verified over three publicly available datasets in the experimental results, demonstrating outstanding performance.
The remaining part of the paper is organized as follows. The related work is presented in Section II. The proposed methodology and architecture are briefly introduced in Section III, which includes semantic segmentation and vehicle detection using CNN and a CPPM, respectively. Classification of vehicles into seven categories is performed by employing linear discriminant analysis (LDA). Section IV covers the experimental results using aerial and remote sensing data. Section V comprises a discussion of the experiments and results. The conclusion and future work are presented in Section VI.

II. RELATED WORK
Numerous researchers have focused on traffic monitoring systems using machine learning approaches, while others have used deep learning frameworks. Most of the researchers have devoted their efforts to performing vehicle detection and 2994 VOLUME 11, 2023 classification. They incorporated hand-crafted features techniques including scale-invariant feature transform (SIFT), speeded-up robust features (SURF), the histogram of oriented gradients (HOG), and Haar-like features. Once these features are extracted then they applied various machine learning classifiers to detect and classify vehicles in the imagery. These methods are computationally complex and expensive due to their approaches of sliding windows and multi-level search. In the recent past, deep learning-based methods are performing better compared to the previous techniques, particularly for vehicle detection in aerial images and scene understanding tasks. By using convolutional neural networks (CNNs), deep learning-based methods provided superior feature representation than the hand-crafted features and shorter processing times than the sliding window-based methods. CNN-based object detectors are mainly divided into two-step and one-step detectors. Two-step detectors, such as R-CNNs, Fast R-CNN, Faster R-CNN, and Mask R-CNN, use region proposals to complete object location regression and classification processes in two steps. In contrast, onestep detectors, such as YOLOv3 and the single-shot multibox detector (SSD), predict object locations and classes simultaneously in a single network. However, CNN-based methods for vehicle detection in aerial images are limited. Specifically, they perform less satisfactorily in the localization of small objects in a large scene. In addition, training these networks generally demands a high computational cost, and the lack of well-annotated training data adds to the challenge. In this study, we aim to introduce a robust vehicle detection and classification framework that requires limited training data and computational power.

A. LEARNING-BASED VEHICLE DETECTION
For decades, machine learning has been extensively used in computer vision tasks, particularly intelligent traffic management, and monitoring. F. Tang et al. [13] presented a model that considers both the value matrix and spatial-temporal training model while extracting features to predict traffic patterns. They simulated their model and demonstrated a better packet loss rate, average accuracy, and transmission throughput. Liu et al. [14] devised a method to improve the segmentation of the objects and then apply a probabilistic classification model to detect the vehicles correctly. They used aerial images and LiDAR data for the purpose. Tang et al. [15] conducted experiments for vehicle detection on static images by extracting Haar features and then employed an AdaBoost classifier to detect the vehicles in the images. Their approach is practically suitable for various applications of surveillance. Ukani et al. [16] introduced a vehicle detection and classification system that considers video to analyze traffic. They extracted SIFT features for further processing by incorporating the artificial neural network as a classifier as well as a support vector machine (SVM). Their experiments showed better performance when applied SVM. Huang et al. [17] used a combination of background subtraction and a deep belief network to detect the vehicles in a tunnel. It's a challenging problem as different cameras are VOLUME 11, 2023 used in the tunnel. There are also resolution and illumination problems due to reflection on the walls of the tunnel.  [20] developed a one-step vehicle detection network (AVDNet) that would be very good at identifying small vehicles. In AVDNet, they added ConvRes residual blocks to handle the small object problem by deeper convolutional layers while extracting features. The larger feature map at output combined with these residual blocks ensures that the important features extracted from small-sized objects are wellrepresented by the map. They also came up with a way to look at the network's behavior through recurrent-feature aware visualization (RFAV).
In [21], Al-qaness et al. presented a new technique that is used to track vehicles based on video surveillance intelligently. They combined different models to track the vehicles. Initially, they process video by incorporating CNN, and then they use YOLOv3 as an object detection model that is capable to locate the object's position, scale, and category of the object in the image frame. They carried out various experiments to detect objects of different scales including small, medium, and large-scale objects. Moreover, they used average precision, recall, precision, and intersection over union scores to measure the efficiency of the system. Although their proposed system is capable of detecting vehicles on roads and highways. However, there are still some challenges that need to be addressed. For instance, more than 50% of occluded or overlapped objects/vehicles are not correctly detected and tracked. Similarly, nighttime vehicle tracking is a challenge that is not addressed in this study. In [22], Cheng-Jian Lin et al. introduced a three-tier system that is proficient in detecting, counting, and classification of vehicles in different scenarios. They used YOLO for vehicle detection in the first phase. In the second phase, they employed the Kalman filter fused with the Hungarian algorithm to count the vehicles. Finally, a convolutional fuzzy neural network is applied for the classification of vehicles into various categories. Their proposed model is effective to increase the accuracy along with decreasing the parameters. In [23], Peña Cáceres et al. proposed a model to detect the helmet during riding a motorcycle using YOLOv4 algorithms. Their model consists of seven phases including acquiring data, processing it till the completion of the system, and then deployment of the model. They performed various experiments using online platforms. Moreover, they set the ratio to 60:35:5 for training, validation, and testing, respectively while achieving an accuracy of 88.65% detection.
This research aims to contribute to modern world technologies in machine vision. At the same time, the primary purpose of our system is vehicle detection and traffic monitoring to control massive transport. Further, we aim to improve the performance of our system and better results than existing vehicle detection and traffic monitoring systems. Our goal is to try different deep learning techniques to give the best possible vehicle detection accuracy.

III. OUR APPROACH
Initially, the videos containing traffic data are converted to a sequence of frames. These frames are then undergone a segmentation process one by one until the last frame appears. Then, segmented images are analyzed for vehicle detection by employing CPPM. These detected vehicles are also classified into seven different vehicle categories. The detected and classified vehicles are tracked through two different approaches: Kernelazied correlation filter-based vehicle tracking and Kalman filter-based vehicle tracking.
To get better results for vehicle detection, tracking, and traffic monitoring, we converted the video into a sequence of images/frames for further processing. Once the frames are extracted from the traffic video, three different types of noise are examined and frames are de-noised by using various filtering techniques. Only the best-suited filter that incorporates the real-time defogging processing of the aerial  images is applied to the respective noise for the best results. The preprocessing step is shown in Fig. 2.

B. CNN-BASED SEMANTIC SEGMENTATION
After the pre-processing phase, image segmentation is performed to separate the vehicles from the other objects and backgrounds. A CNN-based semantic segmentation technique is applied for this purpose. In this phase, a SegNetbased network is described as having two streams. For faster information flow, we used residual blocks with skip connections. Two convolution layers are presented in the residual block, namely conv I and conv II. Layer one comprises 128 filters with a size of 1 * 1, while the size of filters for other layers is 3 * 3 with 128 filters. The output produced by the residual block is combined with the output of the second convolutional layer.
In this study, a unique encoder-decoder-based architecture is used. The structure comprises two components: the first component involves five convolution blocks, while the second consists of rectified linear unit ( ReLU) and Batch Normalization (BN). By incorporating un-pooling layers in the encoder and decoder, we can restore the resolution to its original state. The encoder and decoder are present in both streams, but at the end of the streams, the combined result of both streams is considered for further processing. A residual block with skip connections is also utilized, as revealed earlier, to send information from each encoder convolution block to its respective encoder-decoder convolution block in both streams. Fig. 3 demonstrates semantic segmentation results over a few examples of the VAID dataset.
In order to get the networks to converge faster, we used pretrained VGG-16 weights on ImageNet as beginning weights for 50 epochs. The PyTorch framework was used to build the networks. Each convolution block utilizes batch normalization. Network weights are optimized via stochastic gradient descent. The starting learning rate for all decoders and encoders is 0.01 and 0.005, respectively. After 20, 30, and 40 epochs, the learning rate reduces by a factor of 10.

C. VEHICLE DETECTION VIA CUSTOMIZED PYRAMID POOLING MODULE
Local ambiguity can be alleviated by contextual information, as demonstrated in [24]. In VOC2012 [25] and PASCAL-Context [26], ParseNet [14] combined successfully the local features with global pooling to enhance the features set.   However, it falls short of what would be required in a more complex scenario. Based on the successful object recognition technique of spatial pyramid pooling, PSPNet [27] integrated various sub-regions to increase inclusive contextual information. There are four sub-region pyramid pooling module scales, including one global pooling layer. The other non-overlapping pooling layers comprised bins with variable sizes. The stride and the kernel size are the same for these non-overlapping layers.
Non-overlapping pooling results in the feature map's spatial size being divided by its kernel size. For this module to work, an input feature map (IFM) must be compatible in terms of a factor of the size of the kernel. Alignment issues could arise as a result of pooling and up-sampling the module. For instance, if the kernel sizes are 40, 20, and 10, then the sum of these kernels is 70, and a multiple of 70 is required for IFM. Unlike the non-overlapping pooling module, the CPPM is more effective. The levels and the kernels are variable-sized and treated as hyperparameters. The first layer is responsible for extracting global features by creating a single bin output. At the same time, the local features are extracted by the other three layers (overlapping pooling layers). The IFM is of fixed size as stride and padding of overlapping pooling layers are kept constant. To reduce the size of the feature map, nonoverlapping pooling is performed with a small kernel before applying a CPPM. A max or average pooling operation may be executed. An up-sampling operation with bilinear interpolation is performed to make the feature map compatible. Then, all these three features are fused. The CPPM module is consistent as it uses the IFM of any size as it utilizes the stride of 1 for customized pooling. Fig. 4 and Fig. 5 illustrate the vehicle detection results over some images from the VAID and DLR3K datasets respectively. Linear discriminant analysis [28] is a variant of the Bayesian model. It uses class labels for training purposes as it is a  supervised technique. LDA tries to keep intra-class variations low and inter-class variations high. It is employed to classify the detected vehicles into various classes. LDA doesn't need to be scaled since it finds its coefficients based on the difference between the classes. Fig. 6 and 7 show the classification results over the VEDAI and VAID datasets, respectively, where each class is separated, and a total of nine classes are grouped by using the equation as follows: where the mean for all the classes C is denoted by Meu i , represents the covariance, and Meu is symbolized for the mean of class means.

E. VEHICLE TRACKING VIA KALMAN FILTER METHOD (KER_FILTER)
Kalman filter-based vehicle tracking [29] and its variants [6], [12] are commonly used methods in computer vision tasks and mathematically can be described as follows: where X t ∈ R n is used to represent the state vector, Y t ∈ R m is process noise, ω t ∈ R n and v t ∈ R n is used to measure noise at step t. Process behavior A t nxn and output matrix C t mxn are the matrices that are commonly used with required dimensions. ω t and v t are type of noise.
Kalman filter also uses probabilities in terms of the prior and posterior probability that can be expressed mathematically as follows:X¯t Local data collected by each node is relayed to a central server for global estimations, as is the practice in more traditional central approaches. Using KF, all nodes communicate with each other in a decentralized manner. The computation process is heavy and takes a long time. To handle the computation time, alternate methods like distributed Kalman filter (DKF) and diffusion least-mean-square DLMS, are used due to their efficiency based on the information processing mechanism. To DKF, there is no need for a central layer, as every node has the capability that can estimate the system's stale. Fig. 8 illustrates the results of vehicle detection by incorporating the KF tracking. Usually, to identify the target vehicle in the frame, a bounding box around the vehicle is drawn. While considering the correlation filter tracking method [30], highly sampled and circularly shifted image patches are synthesized to build a circular data matrix. This method increases the training sample's capacity without compromising accuracy. The location of the maximum correlation response also aids detection in the successive frames, making it easier to recognize. Given x ∈ R P×Q×C where P × Q denotes the size of the patch with channels C taken from the sample image. All the circulant images M (p,q) with p < P, q < Q are combined to produce the circulant matrix M . Hence, the discrete Fourier transform (FT) is used to compute the eigenvectors of a circulant matrix M : where the Hermitian transpose of F is denoted by matrix F H . The diagonal matrix F(m). Diagonal(.) acquired by the corresponding vector and called the FT of xˆ. The correlation filter w and bias b are used to justify the equations: Here, all the variants of the original image such as patch M (p,q) are part of the circulant matrix M = M (0,0) ; M (0,1) ; . . . ; M (P−1,Q−1) . Each of new sample M is assigned a unique class label and these class labels are expressed as: y = [y(0, 0), y(0, 1), . . . , y(M − 1, N − 1)] T . while F −1 (·) is to represent inverse discrete FT. The difference of the central place ||r * −r m,n || is used to assign the labels of class ''y'', which is between the region of interest and the image after the circular shift x (m,n) .
where the range of values is represented by l o and u o as a minimum and maximum, scale and shape parameters are denoted by sc and sh, respectively. The kernel is represented as the following: To define ψ(x) which is a non-linear feature mapping, a kernel function K (x, To define the kernelized correlation filtering process, eq. (10) can be written by incorporating the properties of circulant matrix K, w 2 = α Kα = α . Given ξ = e + 1 − y • F −1 x * •ŵ + b1 , the linear constraint is represented by e, the autocorrelation among the kernels may be computed bŷ k xx e.g. k xy = exp − x 2 + y 2 −2F −1 ŷ * x (an RBF kernel).
In this work, before the fusion of kernels, a unique Gaussian kernel to preserve the responses of the filtering, is produced with the help of various features. If we have an l-th type of feature vector x (l) having size M × N × D, then, the training examples of that specific feature vector along their dimensions are computed by the circular shift operation. The estimated response map may be expressed mathematically as follows:  where the optimal coefficient vector is represented byα z is used to compute the requisite place of the l-th feature vector. Multikernel correlation responses are integrated into a final distribution map that is dynamically combined using different kernel filters as shown in Fig. 9.
Scaling parameters can be estimated using variable-scale pyramids, which are able to adjust to variations in appearance. More than one sample is taken from the present target location, and these samples are called ''scale-pool samples'' (S = {s 1, s 2, s 3... s v }). As soon as a new frame becomes available, the highest possible number of v correlation responses can be used to identify both the target's position and its scale at the same time. Normally, we expect the optimal response map to have a sharp peak, but a further decline may cause the response map to be significantly transformed. It is effective to determine the optimal learning rates for the (l) different sorts of feature kernels based on the highest points of respective response maps. We can define the maximum and minimum ability of response as: max and R (l) min respectively while σ (l) is used to denote the standard deviation. To update the coefficients αˆ( where the fusion parameter is called η. Although the original template shape can be preserved to some extent, repetitive pattern filters can also be derived using this method.

A. DATASETS DESCRIPTION
During our experiments, we have considered three complex aerial imagery datasets including VAID, VEDAI, and DLR3K datasets. The details of these datasets are given as follows:

1) VAID DATASET
The VAID [31] dataset was presented by H.Y. Lin et al. in 2020 for intelligent traffic monitoring via detection and classification of vehicles. The dataset comprised 6000 images of vehicles and was classified into seven different classes such as minibus, cement truck, truck, sedan, pickup truck, bus, and trailer. A drone is used to capture these images in different illumination conditions. The drone is elevated between 90 and 95 meters for consistent images of vehicles. The resolution of images captured at 23.98 frames per second is 2720 × 1530. The images are resized, and pre-processed images' resolution is 1137 × 640. The dataset includes traffic and road conditions for ten places in southern Taiwan. A university campus, a city suburb, and an urban environment are all depicted in the images. Fig. 10 shows the example images from the VAID dataset.

2) VEDAI DATASET
VEDAI [32] is a dataset for vehicle detection in aerial imagery proposed in 2015. The dataset helps researchers find vehicles in aerial images. There are small vehicles in the dataset, and they have various features, like different orientations, lighting, shadow, or occluded objects. A standard protocol is also provided to reproduce and compare the results generated by other researchers. For this dataset, performances  of some baseline algorithms are also given. Fig. 11 illustrates some images from the VEDAI dataset.

3) DLR-3K DATASET
DLR-3K dataset [33] is a collection of various aerial scenes of vehicles from urban as well as some residential areas.
The dataset is also known as DLR Munich vehicle detection dataset and comprised 20 images of high resolution (5616 × 3744) with vehicle types including ''car'' and ''truck''. The number of images having the ''car'' class is more than that of the other type of vehicles. To train the model, original images are divided into nine parts (3 × 3) which results in a total of 180 images. A few example images of the DLR3K dataset are shown in Fig. 12.

B. IMPLEMENTATION DETAILS
To implement the system, we set an environment by using python 3.7. The vehicle detection results are based on the CPPM and the detected vehicles are marked with bounding boxes around them. The performance of detection depends upon the minimum threshold that is set to detect an object and intersection over the union score. The object and class confidence values are computed, as given below:

1) TRAINING CONFIGURATION
A system with a GeForce RTX 3080 Ti GPU is used to train the model. To determine the input layer size and other parameters, parameter sensitivity analysis is performed that authenticates the computational performance as well as the accuracy of the model. The sum of square errors from the final layer of the network is used to compute the training loss. The details of the parameters used during the training process are described in Table 2.

2) MODEL TRAINING
For the model training, we considered train and test sets with a ratio of 80:20 for VEDAI and VAID respectively. On the other hand, a 70:30 ratio was applied over the DLR-3K data set for the train and test respectively. The proposed model is used to train over each dataset and during the training no pretrained weights are used. The proposed model over VEDAI, VAID, and DLR-3K datasets executed 20k iterations during training. The learning rate is changed after each 5K iterations by a factor of 100. Multiple bounding boxes are generated for each object. The object with the highest score of IoU is selected in the proposed model on the basis of the specified threshold.

C. RESULTS AND ANALYSIS 1) QUANTITATIVE ANALYSIS
In this section, we conducted experiments to record the detection and classification accuracy of the proposed model over benchmark datasets in order to ensure its validity compared to other existing methods.

a: DETECTION ACCURACIES
The evaluation of the proposed model is conducted over three benchmark datasets: VEDAI, DLR-3K, and VAID. We computed the different metrics including mean accuracy precision (mAP), specificity, recall, precision, and F1 Score. The detailed analysis of the metrics is recorded in Tables 3 and 4 over VAID and VEDAI datasets respectively. In order to certify fairness, a similar set of unseen samples from the test data is used to evaluate the proposed model. The results showed remarkable performance over the existing state-ofthe-art techniques.

b: CLASSIFICATION ACCURACIES
In this section, experiments are executed to validate the significance of our proposed system. To present the results, we computed the confusion matrix for vehicle classification over the VAID and VEDAI datasets as shown in Tables 5 and 6. It is evident from Table 5 that the proposed model has significant results with an average classification accuracy of 96.71% over the VAID dataset. Moreover, sedan and trailer classes have the highest accuracy, while cement truck has the lowest accuracy. Similarly, Table 6 depicts that the car class achieved the highest classification accuracy while the camping car lies at the bottom-most in the classification accuracy list.

2) QUALITATIVE ANALYSIS
We examined the results of our proposed model qualitatively in a variety of challenging circumstances. The semantic segmentation and detection of vehicles in three benchmark datasets are performed with higher mean accuracy and classification results are tremendous by applying the proposed model. Additionally, the proposed system is smart enough to identify partially occluded vehicles as shown in Figure 13 (a) where some partially occluded vehicles under the shade are detected by the system and while in Figure 13 (b) some vehicles are occluded under the trees yet the system is able to detect these vehicles. Our proposed system is also helpful in the robust detection of vehicles with different shapes and  orientations Furthermore, the system is also capable of detecting vehicles sheltered due to the shadow of other objects. Although there are some failure cases where the occlusion is more than 50% of the object in terms of pixels as shown in Figure 14. Where most part of the vehicle is occluded under a tree, the system is unable to detect it as a vehicle. However, the overall qualitative results validate the performance of our system in a diversity of challenging environments.

3) COMPLEXITY ANALYSIS
The computation and space complexity of the proposed system is computed in terms of the number of parameters and model size. The complexity comparison with current SOTA methods is illustrated in Table 7. It is evident from the statistics that the proposed model has fewer parameters when compared with existing techniques including YOLO, RetinaNet, and Faster R-CNN. Moreover, the proposed model requires less memory space as compared to YOLO and other deep networks like Faster R-CNN, or RetinaNet. As a result, the proposed method is more efficient (in terms of computation and memory space) than the current SOTA methods.

4) COMPARATIVE ANALYSIS
We evaluated our model and compared it with the existing techniques available in the literature and considered it to be the SOTA technique. Tables 8, demonstrates the comparison  of classification accuracies between the proposed model and SOTA techniques on VEDAI and VAID datasets while Table 9 illustrates the comparison of detection accuracies over the DLR3K dataset.

V. DISCUSSION
The proposed traffic monitoring system was designed to manage traffic over high-resolution aerial imagery. In this study, we developed a framework that uses CNN-based semantic segmentation to effectively segment out the objects more specifically the vehicles in the aerial images. These segmented objects are further examined for vehicle detection through the proposed CPPM. Then the detected vehicles are categorized into different classes. Additionally, all the categorized vehicles are tracked by employing Kalman filter and kernelized filter-based tracking techniques. Both techniques produced good results yet kernelized filter-based tracking super pass the earlier one.  Primarily, high-resolution aerial images are very critical, particularly when dealing with vehicle detection. Therefore, an effective mechanism of CNN-based semantic segmentation was incorporated to achieve significant results for segmented regions from the complex high-resolution scene images. Once the aerial images are segmented, they are analyzed to detect different vehicles. The detection phase is the most important part of the system where a novel CPPM technique is devised that effectively enhance the efficiency of the overall system. There is a significant increase in accuracy, precision, and recall in both detection and classification as a result of CPPM. Moreover, the proposed CPPM technique supplements the effective tracking of the vehicles once detected effectively.
We applied various techniques including hyper region proposal network (HRPN), Faster R-CNN, aggregated channel features (ACF) detector, and CPPM to authenticate the validity of our proposed detection mechanism. For this purpose, the same benchmarks i.e. VAID, VEDAI, and DLR-3k are considered for vehicle detection. The average detection accuracies of these techniques are shown in Fig. 15. It is evident from Figure 13 that CPPM outperforms the other techniques in terms of detection accuracy over benchmark datasets. It is observed that ACF has the lowest performance compared to Faster R-CNN, HRPN, and CPPM (our proposed) vehicle  detection techniques. Moreover, our proposed CPPM has remarkably performed over all the considered datasets.
Moreover, once the vehicles are detected, they are further investigated for classification purposes. These classified vehicles are then tracked via two unique techniques i.e. Kalman filter-based tracking and kernelized filter-based tracking. The accuracy of both the tracking algorithms is compared after tracking. The algorithm having better accuracy is adopted for final vehicle tracking. In most cases, the kernelized filter-based tracking algorithm has shown better tracking results compared to Kalman filter-based tracking. Hence, kernelized filter-based tracking is adopted for the tracking. The results of both Kalman and Kernelized filter based tracking are shown in Figure 16.

VI. CONCLUSION
This paper presents a framework for vehicle detection over aerial images from drones. The proposed model potentially deals with intelligent traffic monitoring, traffic management, and smart surveillance systems. The novel traffic monitoring system enhanced the efficiency of vehicle detection based on the proposed customized pyramid pooling module. The initial module efficiently segments the images before applying the novel customized pyramid pooling module to detect various vehicles in the aerial images. These vehicles are then classified into different categories via linear discriminant analysis.
Finally, these classified vehicles are tracked via Kalman filter and kernelized filter-based tracking. The effectiveness of our methodology is not only validated on the VAID, VEDAI, and DRL3K datasets, however, a comparison with other SOTA methods also demonstrated the significance of the proposed method by the experimental results.
The datasets considered for experiments are complex as well as dynamic and diverse backgrounds with different types of vehicles. These scene images are collected from various locations including rural and urban areas. Due to the dynamic scenes with messy information about vehicles along with cluttered backgrounds, our proposed detection module (CPPM) over different datasets responded differently. We faced difficulties under conditions like partially or fully occluded, covered under trees or shades of buildings and similar objects. In future work, we are planning to improve the effectiveness of vehicle tracking using an end-to-end deep learning method for overall traffic monitoring based on vehicle detection and tracking for surveillance. His research interests include high-reliable autonomic computing mechanism and human-oriented interaction systems. VOLUME 11, 2023