A Systematic Review of Drone Based Road Traffic Monitoring System

Drone deployment has become crucial in a variety of applications, including solutions to traffic issues in metropolitan areas and highways. On the other hand, data collected via drones suffers from several problems, including a wide range of object scales, angle variations, truncation, and occlusion. To process and manipulate visual data from the drones, a variety of image processing algorithms have been employed, each with a distinct aim. Additionally, recent breakthroughs in the field of Artificial Intelligence, particularly deep learning, have attracted broad interest and are being applied to many domains in the framework of smart cities, including road traffic monitoring. The purpose of this study is to conduct a systematic review of drone-based traffic monitoring systems from a deep learning perspective. This work focuses on vehicle detection, tracking, and counting, since they are fundamental building blocks towards founding solutions for traffic congestion, flow rate and vehicle speed estimation. Additionally, drone-based datasets are examined, which face issues and problems caused by the diversity of features inherent of drone devices. The review analysis presented in this work summarizes the literature solutions provided and deployed so far and discusses future research trends in establishing a comprehensive traffic monitoring system in support of the development of smart cities.


II. BACKGROUND AND MOTIVATION
According to a study from the United Nations, the global 105 population is predicted to grow by two billion people by 106 2050 [13]. Consequently, issues concerning transportation, 107 both related to infrastructure and traffic monitoring, are ris- 108 ing as a result of high population growth, thus driving the 109 rapid development of smart cities. Unlike traditional cameras 110 deployed for traffic surveillance, drone-based vision enables 111 a diverse and smart system to be designed. A general frame-112 work for AI-based traffic monitoring using UAVs can be 113 outlined as follows: acquisition of images/videos, data pro-114 cessing using AI/DL techniques, output to the assigned users 115 or management centers [9]. In this connection, a detailed 116 guide on implementing a UAV-based traffic monitoring sys- 117 tem is given in [14]. UAV-based traffic monitoring tasks can 118 be carried either on-board or via cloud-based video process-119 ing. Fig. 2 illustrates a high-level schematic view of the 120 processing and analysis of cloud-based aerial images/videos. 121 As discussed above, implementing drone-based traffic 122 monitoring tasks in urban areas and on highways is quite 123 [28]. Mostly, DL-based frameworks use pre-trained models 146 to improve and refine the feature extraction, such as in [29]. 147 In [30], a review of some of the main state-of-the-art object 148 detection frameworks is provided along with some experi-149 mental analysis. In contrast to static images, object detection 150 in videos requires processing of each frame. Additionally, the 151 following concepts must be taken into account: the spatial and 152 temporal correlation of frames to overcome feature extraction 153 redundancies, and the effect of motion blur, occlusion, and 154 posture changes, which can contribute to the low quality of 155 some individual frames in a video sequence [31]. Indeed, 156 overlooking these factors results in a decrease in the object 157 detection performance in videos. The difference between 158 static and video object detection frameworks lies in the incor-159 poration of temporal information. Furthermore, although 160 video object detection and Multi-Object Tracking (MOT) 161 both use UAV video data, the way temporal information is 162 employed differs in the two cases. The former aims at improv- 163 ing the detection rate of the current frame by exploiting con-164 text information from previous frames, while the latter aims 165 at forecasting the trajectory of objects in future frames. The 166 difference between these two approaches is shown in Fig. 3. 167 Furthermore, in UAV-based videos the movements of the 168 camera within a scene, usually referred to as ego motion, com-  [33], [34], [35] and optical flow [33], [34], [36]. 173 In image registration solutions, moving background is turned 174 into a fixed one, hence facilitating the background subtraction 175 task [17] to overcome the ego motion issue. Alternatively, 176 optical flow is also used in conjunction with image regis-  detecting vehicles in dense traffic situations. However, optical 180 flow combined with supervised learning enables the detection 181 of vehicles in a dense traffic environment using a UAV video 182 with ego motion. 184 Keeping in view the need and importance of drone-based traf-185 fic monitoring system using DL, a systematic study has been 186 carried out on three domains characterizing traffic monitoring 187 systems: vehicle detection, tracking and counting. Various 188 frameworks, methodologies, and DL-based solutions have 189 been adopted and validated so far to design such systems. 190 To conduct the review, the following RQs have been devised 191 in relation to the identified objectives:   Table 3.  The output of a particular detector is predicted as could predict, threshold values such as confidence threshold 352 C t and IoU threshold I t are assigned. If the predicted label l i 353 matches the GT label, the predicted score s i is greater than 354 C t , and the IoU value is greater than the I t , the predicted 355 output (b i , s i , l i ) is classified as TP, otherwise it is considered 356 FP. The evaluating metrics, precision and recall, of the object 357 detection algorithm are defined as follows:

III. METHODOLOGY
Precision measures the true positive predictions among all 361 positive predictions, while recall refers to the same concept 362 applied to all predictions, both positive and negative. Other 363 metrics, such as quality, completeness, and correctness, are 364 also used to evaluate the performance of object detection 365 frameworks, for instance in [53]. Completeness is similar to 366 recall, while correctness is equivalent to precision. Compared 367 to completeness and correctness, the significance of the qual-368 ity metric remains high because it incorporates both of them 369 and calculated as follows: Since precision and recall metrics do not include TN pre-373 diction, the Precision-Recall (PR)-curve is also employed as 374 an evaluating metric to examine the performance of the detec-375 tor, particularly for the unbalanced dataset, as given in [45]. 376 For a given confidence threshold C t in the detector, an optimal 377 RP-curve should have high precision and recall values, which 378 means a large Area Under the Curve (AUC). However, in real 379 experiments, there is always a zig-zag pattern for PR-curve 380 due to noise, which makes true measurement of AUC for 381 precision and recall values quite difficult. The zig-zag pattern 382 is removed before estimating the AUC, and this is done 383 by computing the AP using either 11-point interpolation or 384 all-point interpolation. The former consists in interpolating 385 the precision value for 11 recall points e.g., (0, 0.1, . . . , 1). 386 This method was initially proposed in Pascal-VOC com-387 petition [54] and later changed to all-point interpolation. 388 Adopting the all-point interpolation approach increases the 389 MOTA and MOTP values should be higher for an efficient 442 and accurate tracker. Furthermore, the VisDrone2018 com-443 petition [62] employed alternate evaluating metrics for the 444 two MOT tracking tasks, defining MOT-a and MOT-b based 445 on the prior availability of object detection data in each 446 video frame. For MOT-a (no prior object detection results), 447 AP-based evaluating metrics were employed [63], while con-448 cerning MOT-b, evaluating metrics from [64] were used 449 (with prior detection results). For the performance evaluation 450 of MOT algorithms in Visdrone-2019 and later [40], [65], 451 authors used the metrics from [63], irrespective of the avail-452 ability of prior detection outcomes. Regarding the evaluation 453 of tracking tasks in traffic monitoring systems, MOTA and 454 MOTP metrics have been primarily employed. In addition 455 to these two, precision and AP have been computed. Dif-456 ferently, some works have reported tracking performance by 457 illustrating motion tracks, such as in [66], while others have 458 only presented qualitative analysis, for instance in [67]. Also, 459 tracking rate, which is the ratio of accurately tracked vehicles 460 to all tracked vehicles, is used in [3]. This section discusses two main aspects of RQ2. The first 469 part addresses and presents data from explored drone-based 470 traffic monitoring systems in the context of detection, track-471 ing, and counting, while the second part discusses the image 472 pre-processing and augmentation techniques. We choose to 473 present works related to the different aspects separately: 474 vehicle detection first, followed by the vehicle tracking and 475 counting sections. Additionally, on-board traffic monitoring 476 systems are mentioned individually. Based on the conducted 477 research, Fig.4 depicts the taxonomy of traffic monitoring 478 systems for various properties such as processing platform, 479 subsystems/tasks, environment type, issues addressed, and 480 UAV state. Noteworthy is the fact that drone-based traffic 481 monitoring systems have implemented detection, counting, 482 tracking, congestion analysis, flow rate, and speed estimation 483 tasks. This study focuses solely on the detection, tracking, 484 and counting tasks, as indicated by the dashed lines. While 485 implementing subsystems/tasks, the traffic monitoring sys-486 tem exhibits diverse characteristics. For instance, some of the 487 work involved on-board computation of traffic monitoring 488 tasks, while others involved remote processing. Similarly, 489 few studies addressed occlusion problems, whereas others 490 concentrated on real-time processing.   Once (YOLO)9000 [82], and YOLOv3 [25] frameworks. Fol-537 lowing another approach to account for the features of small 538 size objects, the authors in [70] presented De-convolutional 539 YOLO (DYOLO), in which all layers following conv5_5 540 and conv6_5 were eliminated from the backbone model, 541 and extra convolutional layers for up-sampling the features 542 to take advantage of context information were added. As a 543 consequence, DYOLO outperformed Faster Region-based 544 CNN (Faster R-CNN) [83], SSD [81], and YOLOv2 [82] 545 frameworks. In addition, to improve vehicle detection perfor-546 mance in high resolution multi-scale aerial images, Li and 547 Li [71] proposed Image Spatial Pyramid Detection Model 548 (ISPDM) with YOLOv3 framework. For the original and 549 image patch layers of the image spatial pyramid, an integrated 550 decision-making algorithm was formulated to reduce multi-551 ple detections of same objects. Results increased over base 552 YOLOv3 [25], Faster R-CNN [83] and SSD [81] frameworks. 553 Moreover, Benjdira et al. [72] used Faster R-CNN [83] and 554 YOLOv3 [25] to detect cars for traffic monitoring purpose. 555 To analyze the traffic from UAV images, Adaimi et al.
[73] 556 designed a Butterfly detector to handle the difficulties of 557 wide range of object scale, viewing angle fluctuation, and 558 occlusion. This is achieved by introducing the butterfly field 559 concept, which describes the spatial information of out-560 put characteristics and object scale, while occlusion and 561 viewing angle difficulties are addressed by using a voting 562 system between butterfly vectors pointing to the object's 563 center. The Butterfly detector is an anchor-free detector 564 that overcomes the disadvantages of both anchor-based and 565 anchor-free detectors by introducing the characteristics such 566 as locating the center, width, and height information of 567 the object of interest and generating the butterfly field 568 from object-specific features with specific aspect ratios.   Li et al. [12] carried out vehicle detection using YOLOv3 [ It is observed that in most of the studies, the length of the   training process works by extracting features from provided 788 input images, these approaches help the model to learn by 789 presenting the same scene from diverse perspectives. Differ-790 ent state-of-the-art techniques used to train the DL model 791 for the vehicle detection task are discussed in the following. 792 Concerning image pre-processing, in [70] all input images 793 were tiled into 512×512 pixels and tiles containing no object 794 were not considered in the training part. Similarly, in [71] 795 images are also divided into patches, where only areas con-796 taining objects were selected using SURF algorithm [19] for 797 the training purpose. Furthermore, Wang et al. [74] divided 798 high resolution images of 9000 × 6700 pixels, taken by drone 799 with fixed height and angle, into 100 pieces and cropped 800 to 900 × 670 pixels, also discarding the duplicates before 801 training. Moreover, to improve multi-scale vehicle detection 802 accuracy, Li et al. [69] reduced the image sizes without 803 distorting the vehicle information by cropping and dividing 804 the raw input image into two patches. These two patches 805 and the original image were combined together into a single 806 batch to feed CNN for feature extraction. In addition, in [11] 807 the input images were segmented into a 4×4 block, adding the 808 segments together with the original image, thus increasing the 809 training set by 5× times. This practice of data augmentation 810 helped in terms of detection of small size objects using the 811 aerial images.

812
Due to instability of recorded video, image alignment tech-813 niques (feature descriptors, homography between images) 814 were employed in [92]. For feature descriptors and homog-815 raphy, Oriented FAST and Rotated BRIEF (ORB) and RAN-816 dom SAmple Consensus (RANSAC) algorithms were used 817 for their implementation. Due to memory and processing 818 speed constraints, original frame dimensions were reduced 819 from 4096 × 2160 to 3797 × 400 pixels by cropping, before 820 feeding to the detection framework. Furthermore, concern-821 ing the data preparation task in [95], circulant structure of 822 tracking-by-detection with kernels [117] was used to annotate 823 the training images of a video scene, while rotation based data 824 augmentation technique was used to increase the number of 825 training samples. Another interesting work in this framework 826 is [96], where augmentation techniques (random scaling and 827 translation) and dropout (0.5 rate) were implemented in the 828 training phase in order to avoid over-fitting. Furthermore, 829 horizontal and clockwise rotations were performed as aug-830 mentation techniques before training in [98].    In addition to these studies, counting of objects in crowd 899 is implemented using DL based Congested Scene Recog-900 nition (CSR) concept, also referred as density map estima-901 tion, in [120] which is not only applied on people but have 902 been proven for vehicles too. The idea in CSRNet [120] 903 is to use the deeper CNN network, end-to-end trainable, 904 to extract the high level features and produce the heat maps 905 for counting. Further, according to [121] density maps rep-906 resent ambiguous features of objects in conjested situations, 907 resulting in an increase in error. As a result, object counting 908 is decoupled into probability and count map regressions, 909 with the former illustrating the probability of each pixel 910 being an object and the latter counting the objects based on 911 the probability map regression. Furthermore, the frequency 912 feature pyramid module is implemented in [122] to address 913 the issues of scale variations, imbalanced data distribution, 914 and insufficient local features in counting crowds and vehi-915 cles. This module addresses the multi-scale variation and 916 imbalance data distributions using frequency branches and 917 the global-local consistency loss function, respectively.

919
Objects tracking is especially useful when object detection is 920 challenging due to various factors such as motion blur, occlu-921 sion, scale and angle variations. In a broader view, tracking 922 algorithms can be classified as online/offline, SOT/MOT, and 923 detection-based/detection-free. In SOT approaches, objects 924 are tracked through their entire motion, independently of 925 their detection, whereas in MOT methods, objects are tracked 926 only if they are first detected and localized. For this rea-927 son, MOT is considered a detection-based tracking approach. 928 Furthermore, MOT solutions employ both offline and online 929 approaches, where offline methods achieve better perfor-930 mance, while online solutions prove more robust.   [24] to track the 976 vehicles in a UAV sequence, where vehicle's current state 977 is predicted based upon the motion model of the last state 978 and then vehicle's position is updated by leveraging its cen-979 troid information. On the same line, [3] and [12] have also 980 employed KF technique. However, in [12] the main focus 981 was given on multiple vehicles speed estimation by taking 982 into account tracking and motion estimation. From a dif-983 ferent perspective, in [93] vehicle tracking is achieved by 984 performing vehicle Re-IDentification (Re-ID) with deep fea-985 tures and motion estimation with KF technique. As stated 986 in [107], DCF tracker alone is not only unable to handle 987 the occlusion problem, but it also can not provide robust 988 tracking. Therefore, a combined model of KF and DCF is 989 used in [107], which is more robust and reliable than indi-990 vidual DCF tracker. However, according to [37], KF and 991 particle filtering techniques are not suited for vehicle tracking 992 from UAV videos with background motion. To address this 993 problem, Ke et al.
[37] employed the KLT approach [119], 994 an interest point method, to estimate background and vehi-995 cle motions. Concerning object data association, which is 996 the problem of matching the predicted BBoxes of existing 997 vehicles with the detected BBoxes of vehicles in the current 998 frame, so as to optimize the number of matches in the two 999 sets of BBoxes, a Hungarian algorithm [129] is implemented 1000 in [3], [12], and [107].

1001
The MOSSE tracker is employed in [5], which is an accu-1002 rate and fast algorithm, where estimated correlation filters 1003 are used to approximate the detected objects location in a 1004 video frame to track the detected vehicles [127]. During 1005 tracking, the area of the detected object in the first frame 1006 is transformed into frequency domain using Discrete Fourier 1007 Transform (DFT) to generate the synthetic data for the ini-1008 tialization of tracker and update of the filter. Furthermore, 1009 in order to track the vehicles with a moving UAV, [91] used 1010 DCF filter with Channel and Spatial Reliability Tracking 1011 (CSRT), where CSRT was employed due to its ability to 1012 achieve high accuracy with respect to MOSSE. Two stan-1013 dard features, HoG and color names, have been used in 1014 CSRT and the decision to start tracking is dependent upon 1015 a threshold on SSIM. issues, such as a larger number of classes, higher altitudes and 1071 different lighting conditions, would likely degrade the overall 1072 system performance. In addition, in the majority of vehicle 1073 detection systems DL models are trained using weights of 1074 pre-trained models based on large scale datasets such as 1075 MS-COCO dataset [77], which contains regular and high 1076 resolution images in their natural context. On the contrary, 1077 drone images have diverse and irregular viewpoints, small 1078 and dense scenarios. Therefore, the use of weights trained 1079 using drone-based large scale imagery would produce a per-1080 formance boost [94]. A further challenge is related to target 1081 size. As altitude increases, vehicles size gets smaller. Also, 1082 with continuous variations in altitude, the corresponding 1083 impact is observed on object sizes.

1084
In addition, simultaneous variations in altitude and move-1085 ment of vehicles make multi-scale detection a quite challeng-1086 ing task. Although researchers have tried to tackle this crucial 1087 and critical problem, in most cases they have not explicitly 1088 mentioned the considered altitude information. In this con-1089 nection, Ham et al. [97] suggested that the SSD framework 1090 is promising for vehicle detection task in aerial imagery and 1091 its inherent multi-scale feature structure facilitated multi-size 1092 object detection, making this solution suitable for traffic 1093 monitoring at generically high altitudes. However, the exact 1094 value or range of UAV altitude considered by the authors was 1095 not mentioned for the obtained results. In addition, results 1096 produced by Zhu et al. [3] inferred that Enhanced-SSD works 1097 better than SSD, Faster R-CNN and YOLO frameworks for 1098 the traffic monitoring tasks for specific urban road traffic 1099 scenarios recorded in UAV videos. Furthermore, according 1100 to [98], multi-level feature fusion is effective to increase 1101 the detection performance, especially when small objects are 1102 considered.

1103
In most cases, it has been observed that the output of 1104 vehicle detection is obtained in terms of BBox for each 1105 detected vehicle. However, more precise information about 1106 detection can be obtained by further segmenting each identi-1107 fied vehicle. The framework proposed in [76] achieved better 1108 results in segmenting the targeted vehicles for thermal images 1109 when compared to existing segmentation-based frameworks 1110 such as Mask R-CNN detector [101]. However, this frame-1111 work considers only a single class, while performance may 1112 vary when dealing with a multi-class scenario. Furthermore, 1113 based on various experimental analyses for object detection 1114 in UAV videos in [110], it was deduced that the selection 1115 of feature extraction model and detection framework must 1116 take into account the processing speed and accuracy, as well 1117 as the ability to detect objects of varying sizes from aerial 1118 views. Additionally, system configuration and memory size 1119 play a significant role in processing speed. As demonstrated 1120 in [91], while estimating the speed of vehicles in video 1121 captured with both static and moving UAV, the frame rate 1122 decreases as the number of vehicles increases, and vice 1123 versa.
task. It is noticed that the tracking outcome is depicted either State-of-the-art detectors such as YOLOv5 [131], 1187 YOLOX [132], and YOLOv6 [133] detectors, are worth 1188 investigating towards increasing the performance of traffic 1189 monitoring systems in case of both small and large scale 1190 objects. Also, the mentioned variants of the YOLO family 1191 are more suitable for processing a large number of frames 1192 per second with respect to other existing solutions. Therefore, 1193 investigating drone based traffic monitoring in this sense 1194 could be promising in terms of accuracy and real-time perfor-1195 mance. Other interesting aspects requiring further research 1196 are pre-processing techniques such as Image alignment, 1197 which are not yet able to handle the issues of UAV elevation 1198 and angle variations precisely. In this connection, the joint use 1199 of georeferenced UAV videos may help in tackling alignment 1200 issues.

1201
The majority of methodologies covered in this review study 1202 process UAV videos using static object detection frameworks 1203 rather than standard video object detection such as [134], 1204 [135], [136]. However, few of them have used image align-1205 ment techniques prior to detection, such as FBIA, in con-1206 junction with static object detection frameworks, and the 1207 temporal information of the UAV videos has contributed in 1208 tracking task. Therefore, traffic monitoring systems can be 1209 implemented using typical video object detection techniques 1210 that make use of temporal and contextual information to 1211 address missed detection in consecutive frames. Thus, a com-1212 parative analysis of traffic monitoring system using static and 1213 video object detection frameworks may be conducted with 1214 respect to computational burden, accuracy and frame rate. 1215 A review of state-of-the-art video object detection solutions 1216 is presented in [137], and the details about spatio-temporal 1217 models and feature extraction strategies from DL perspective 1218 are given in [138]. 1219 Furthermore, implementing vehicle monitoring frame-1220 works working in diverse weather and light conditions is also 1221 a future direction to pursue. Additionally, the detection of 1222 vehicles in aerial scenes during nighttime poses additional 1223 difficulties and challenges in terms of scarce lighting con-1224 ditions, dark environment, high motion blur. Towards solv-1225 ing these problems, detection approaches with event based 1226 camera images could be more promising, compared to frame 1227 based detection approaches. Indeed, event-based solutions 1228 are invariant to absolute illumination levels and more robust 1229 to high latency and motion blur issues as reported in [139]. 1230 Event cameras, which provide high temporal precision, high 1231 dynamical coverage, and low data rates, represent a paradigm 1232 shift from conventional frame-based cameras. thanks to these 1233 properties, event cameras are especially well-suited for sit-1234 uations involving a lot of motion, and difficult lighting 1235 conditions [139], [140]. Therefore, the design of specific frameworks for vehicle detection with event cameras during  From traffic monitoring perspective, the goal of MOT is to 1292 track all the vehicles in an aerial sequence. In most cases, 1293 vehicles are of the same type, especially cars and vans. Also, 1294 if drone is at high altitude or traffic is dense, this causes 1295 problems such as occlusions, moving camera and different 1296 level of occlusions depending upon the camera viewpoint. 1297 Tackling all these issues with a single algorithm is a quite 1298 challenging task. Moreover, tracking algorithms based upon 1299 the concept of sparse representation of targets could be useful 1300 in the presence of different influential parameters such as 1301 motion blur, occlusion, low resolution, illumination variation, 1302 scale variation and background cluttering [4]. An ad-hoc 1303 investigation on the use of such algorithms for multi-vehicle 1304 tracking could be an interesting research topic in the frame-1305 work of UAV-based traffic monitoring.

1306
The literature includes a large number of vehicle track-1307 ing systems implemented using the tracking-by-detection 1308 approach, where data association and motion models have 1309 been used in consecutive frames for the tracking task, given 1310 a target detection output. Any error in the detection output 1311 would reflect on the performance of the tracking task. Con-1312 cerning this issue, [144] proposes as alternative tracking-1313 by-regression, which can also be used for multiple-vehicle 1314 tracking to improve the overall system performance. In this 1315 approach, regression head of the object detector is used to 1316 compute the tracking task. The main issues related to the 1317 tracking problem are occluded vehicles, dense traffic con-1318 ditions, sudden drone camera motions, large displacements 1319 due to low frame rates, small size of vehicles due to high 1320 altitudes, recovering the same vehicles from occlusions in the 1321 dense and high altitude scenes. In this connection, vehicle 1322 Re-ID and motion models in conjunction with tracking by 1323 regression might help to mitigate the aforementioned prob-1324 lems. In details, problems such as Re-ID of partially occluded 1325 vehicles and tracking multiple vehicles in a drone-based 1326 video sequence with high frame rates could strongly degrade 1327 the tracking performance. Additionally, since vehicles move 1328 quickly and sometimes appear to be identical (e.g., color and 1329 shape), the switch ID problem is a significant concern. In a 1330 nutshell, tackling all these issues for the MOT task in traffic 1331 monitoring systems is an attractive and current research trend. 1332 On one hand, a thorough analysis of the state-of-the-art 1333 showed that the different traffic monitoring tasks, namely 1334 detection, counting and tracking, have been usually carried 1335 out separately. On the other hand, the output of one task 1336 is commonly used as input to facilitate the subsequent. For 1337 instance, the detection output is used as input to the tracking 1338 task and the outcome of tracking helps to perform the count-1339 ing task. However, a DL-based framework can be designed 1340 to carry out all these tasks jointly, by defining a new loss 1341 function for the network training. The architecture of such a 1342 framework would reflect in merging the tracking and count-1343 ing blocks along with the detection block in order to devise a 1344 complete DL-based vehicle detection framework.