A Comprehensive Review for Typical Applications Based Upon Unmanned Aerial Vehicle Platform

Unmanned aerial vehicles (UAVs) have been widely applied in military and civilian fields due to their flexibility and effectiveness. As a vital component of UAVs, the vision system has taken on great significance in different applications (e.g., autonomous landing, traffic surveillance, and disaster rescue) to attract widespread attention in recent years. Therefore, the automatic understanding of visual data collected from these air platforms becomes urgently needed in UAV systems. In this review, we revisit and summarize the recent techniques and developments for several typical UAV applications, including object detection, object tracking, and semantic segmentation. In addition, we also highlight the difficulties and subsequent orientations from different perspectives, which may stimulate future research and applications in the UAV vision era.

imaging mechanism and characteristic also pose new challenges for remote sensing vision tasks.
Inevitable Image Degradation: Considering the rapid movement for the UAV platform and the target of the interest, the external environment changes rapidly (e.g., weather, illumination condition, scenes). Moreover, under strong wind, the platform would usually inevitably undergo mechanical vibration, which may even result in motion blur and fuzzy image degradation. Such challenging attributes may bring in a large variety of object appearance, which degrades the quality for the captured data. In addition, harsh scenes, such as rainy or foggy days and night, which has poor visibility, also bring new challenges for the algorithms to detect the object from the background. Therefore, to improve the quality of the captured data, it would be necessary to carry out a preprocessing image module to reduce the noise and correct the camera distortion.
Uneven Target Size and Distribution: Generally, the UAV obtains data from different altitudes using a large aperture, fixed focal, and wide-angle lens, thus resulting in an uneven target size problem. Specifically, some objects may be densely located, even overlap with each other, while some objects may be very sparse; some objects may occupy a large proportion of the image, while some objects are very small with limited distinct features. In [12], Han pointed put that such uneven statistical properties would also increase the difficulties of detecting the targets from their surrounding background.
Viewpoint Variation and Occlusion: Due to the UAV platforms having the characteristics of large freedom and mobility degrees, UAVs might capture the targets from different aspects by flying around the targets by 360 • . For example, UAVs can capture the back or front side of the targets, in which case the targets may have severe variations in the imaging process [13]. This will become a big challenge if the methods do not have the ability for timely online learning and model updates. In addition, partial or even full occlusion is common due to the high mobility freedom of the UAV platform, as illustrated in [14]. However, such attributes would temporally corrupt the target template and may lead to detection failure and tracking drift due to model degradation.
Limited Computation Source: For most of the UAV platforms, only a single CPU could be embedded as the processing resources due to the strict limitations in terms of its weight and power, which greatly limits the on-board computing speed. To this end, the intelligent algorithms should be carefully designed without casting aside high efficiency in order to meet the real-time requirement for on-board processing. In addition, considering of the energy-consuming applications like maneuvering flight, the on-board algorithms also need to be light-weighted enough to save the power supplies at best.
These issues present significant challenges in analyzing the image and video data captured from the UAV platform. Aiming for these challenges, many works have emerged to extract useful information from the data, and perform different tasks of UAVs, thus making the UAV more intelligent.
Notably, researchers have concentrated on various UAV-based vision tasks with cutting-edge deep learning (DL) technologies. Some studies summarize the current research on one specific task of UAVs. However, most reviews target normal camera objects [15], [16], [17], [18], [19], [20], while few reviews focus on UAV-view objects [21], [22], [23]. To the best of our knowledge, there is a lack of survey on UAV emergency landing. Therefore, based on the practical background of the Earth observation, we provide a unified overview of the object detection, tracking, and semantic segmentation technologies of images and videos captured from the UAVs. The main UAVbased Earth observation scene can be seen in Fig. 1. First, UAVs acquire data and preprocess the data using corresponding sensors; second, UAVs transmit the data back to the ground station, and the ground station performs task analysis, including but not limited to, scene classification, target detection, target segmentation, scene segmentation, etc. Our work highlights the following aspects.
1) Unlike other works that only review single tasks or object detection/tracking tasks, our work aims for the typical applications, i.e., object detection, object tracking, and sematic segmentation for UAV Earth observation. It should be noted that object tracking indicates the single object tracking.
2) This article focuses on analyzing various representative and recent algorithms thoroughly. We found that the existing methods are mostly evaluated on a specific dataset, and a comprehensive benchmark is lacking.
3) We summarize the challenges in the UAV imaging process and provide future directions, which could benefit the audience in the UAV vision area.
The overall structure of this study is organized as follows. In Section II, we present a brief description of UAV-based datasets for the abovementioned applications. Section III provides a detailed description of the relevant works and algorithms for these applications. In Section IV, we discuss the potential directions to stimulate the development of this field. Finally, Section V concludes this article.

II. DATASET
Noting that there exist numerous aerial images and video datasets for object detection, single object tracking and semantic segmentation (e.g., DOTA [49], NWPU VHR-10 [50], and VEDAI [51]). In this study, we will focus on reviewing the datasets captured on the UAV platform. Illustration and featured attributes, including the length or sequences, total representative frames, target categories, and their corresponding available websites are shown in Tables I and II.

A. Object Detection
Okutama-Action: Okutama-Action [24] is a human action detection dataset captured from at 45/90 • cameras mounted at two flexible UAV platforms in 2017. It is formed with 43 fully annotated sequences containing 12 actions, including carrying handshaking, drinking, and reading, with 77 365 total frames.  I  ILLUSTRATION AND FEATURED ATTRIBUTES FOR THE EXISTING UAV-CAPTURED DATASETS IN TYPICAL EARTH-OBSERVATION APPLICATIONS The recording UAV works at the height of 10-45 m, with a 30-fps imaging speed and 3840 × 2160 image resolution. Okutama-Action dataset gathers several typical challenging factors in the action detection field, such as abrupt camera movement, remarkable aspect ratio and scale variation, and dynamic action transition.
VisDrone: The VisDrone dataset [25] is collected by the AISKYYE team of the Machine Learning and Data Mining Laboratory of Tianjin University. The image data in this dataset are collected by different types of drones from 14 cities (both country and urban) across China with a variety of lightening and weather conditions (i.e., daytime, night, rainy, and foggy). The initial construction for this dataset starts in 2018. Afterward, several object detection and tracking challenges are host in top-ranking computer vision conferences from then on. In addition, the capacity and difficulty of the dataset would continue to increase over the previous year. More specifically, in 2018, the organizing committee provided 8599 representative frames with ten classes of targets (i.e., pedestrian, person, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle). While in 2022, the dataset supplies 400 videos, including 10 209 static pictures and 265 228 frames to fully validate the performance for participated algorithms. It should be also noted that in VisDrone, most of the targets are densely distributed or overlapped in 2.6 million labeled bounding boxes. Besides that, some targets are extremely small, which also pose great challenges for detectors to generate suitable anchors. Some crucial factors, such as out of view and occlusion factors are also highlighted in the ground-truth for better and more accurate validation.
MOR-UAV: MOR-UAV [26] dataset is collected and made public by the research team in Malaviya National Institute of Technology Jaipur in 2020. This dataset contains 30 UAV-captured video sequences with 10 948 frames at different locations (e.g., highways, agricultural regions, urban areas, and traffic intersections) with various of challenging attributes (flexible viewpoint, altitude, abrupt drone motion, changing lightning conditions, different weather, occlusion, and temporal out-of-view). The authors have categorized the captured 89 783 instances into two classes, i.e., cars and heavy vehicles. The sequences are recorded at 30 fps with the image resolution varying from 1280 × 720 to 1920 × 1080 in MOR-UAV. In addition, the moving instances are automatically labeled using YOLO-mark tool, which would be employed for validating target detection and recognition algorithms.
Stanford Drone: Stanford Drone [52] is a large-scale object detection and tracking dataset, which was collected by Stanford University in 2016. This dataset mainly collects the outdoor scenarios for the Stanford university campus with a 4 K camera mounted at a quad-copter at the height of approximate 80 m above the ground. Afterward, the collected videos are processed and generates a series of image sequences with the resolution of 1400 × 1904. In sum, 929 499 frames with six categories are carefully labeled in the Stanford Drone. Specifically, this dataset cover over 19 000 targets, including 11 000 buses, 22 000 golf carts, 33 000 skateboarders, 13 000 cars, 64 000 bicyclists, and 112 000 walking pedestrians.
UAVDT: UAVDT [34] was collected by the Chinese Academy of Sciences in 2018, which aims to provide a unified large-scale benchmark for multiple tasks, such as vehicle tracking and detection. In UAVDT, 100 sequences from nearly 10 h of raw videos are selected and processed into about 80 000 annotated representative frames at various common scenes, including toll stations, highways, arterial roads, intersections, squares, and so on. About 2700 vehicles are broadly categorized into three classes (i.e., car, truck, and bus), with 840 000 annotated bounding boxes. Furthermore, the image resolution in UAVDT dataset is 1080 × 540 pixels and the imaging speed is 30 fps. Comparing to the other existing datasets, UAVDT contains up to 14 representative challenging factors in detection and tracking (i.e., occlusion, vehicle category, camera view, flying altitude, and weather condition).
CARPK: The Car Parking Lot dataset (CARPK) [27] is proposed by the National Taiwan University, in 2017, which collects 1448 images with approximately 90 000 cars in accordance at four different parking lots. Apart from vehicle detection, CARPK is also the first large-scale UAV-captured dataset to validate counting algorithms, where each vehicle target is manually annotated for facilitating evaluation. As for the details for this CARPK, the image sequences are collected by a Phantom 3 drone with the flying altitude at about 40-m high. In CARPK, the largest target is much more bigger than 64 × 64, and the maximum number of the targets in a single view is about 200, which reflects the characteristic for the target (multiscale and dense distributed) in parking counting application.
AU-AIR: AU-AIR [28] is a multimodal UAV object detection dataset, organized by the Department of Engineering, Aarhus University, in 2020. Different from the other UAV-collected dataset, AU-AIR not only provides the visual data but also supplies the other necessary modal information (i.e., the current altitude, velocity, GPS, time, and IMU). It has 32 823 labeled frames, at the size of 1920 × 1080 pixels. AU-AIR collects 132 034 object instances with eight object categories (namely, pedestrian, trailer, bike, motorbike, car, bus, truck, and van) related to traffic surveillance in various of weather and lightening circumstances.
UVSD: UVSD [29] is collected and formed by Shandong University with a DJI matrice 200 platform in a variety of locations and altitudes in 2020. This dataset is made up of 5874 images with their resolution varies from 5280 × 2970 to 960 × 540 pixels. In addition, the vehicle instances in UVSD are densely distributed with more than 150 vehicles per image. To this end, UVSD could also be employed for the validation for other task, such as vehicle counting. Furthermore, UVSD contains up to 98 600 high-quality annotations with different types, including horizontal bounding-box, oriented bounding-box, as well as instance-level semantic annotations.
DroneVehicle: In order to overcome the low light conditions in UAV visual tasks, DroneVehicle dataset [30] collects 15 532 RGB-Thermal image pairs and 441 642 instances with their resolution at 840 × 712 pixels. The dataset mainly focuses on the urban field covering roads, parking lots, residential fields, highways, and so on. In order to validate the performance for vehicle detection and counting approaches, the UAV platform collects the image sequences from day to night, with large-scale illumination variation. The authors divide the vehicles into five categories, but the number for each class is somehow unbalanced according to the statistical counting in the original article.
BIRDSAI: BIRDSAI [31] (pronounced similar to bird's eye) is a large-scale infrared UAV object detection and tracking dataset, organized by the Harvard University in 2020. Similar to UAV123, BIRDSAI contains both the real aerial videos and synthetic sequences. Specifically, 48 real infrared sequences are collected with changing wavelength on a fixed-wing UAV in multiple protected areas at southern Africa, while 124 synthetic sequences are generated from Air Shepherd. The image frame in BIRDSAI contains lots of challenging factors which may affect stable tracking and accurate detection, such as image rotation, target deformation, large-scale change, and aspect ratio variation, etc. The resolution for each respective frame is fixed at 640 × 480 pixels.
MOHR: MOHR dataset [32] is a TIR object detection benchmark, collected by Harbin Institute of Technology in 2020 to extend the object detection research for large-scale variation, arbitrary orientations, as well as irregular target deformation. In MOHR, 90 014 object instances could be broadly classified into five categories, namely, building, flood damage, truck, car, and collapse. Furthermore, in order to quantitatively validate the performance for the testing detectors, the authors manually annotate this dataset and count the number for each class as follow. In MOHR, there are 41 468 buildings, 25 575 cars, 12 957 trucks, 7718 flood damages, as well as 2296 collapses, with a large-range of scale changes. It should be noted that collapses and flood damages are first concluded as target categories in UAV dataset. Furthermore, MOHR is collected with three types of cameras (Nikon D800, Sonny RX1rM2, as well as DJI Phantom 4Pro) at varying flying height. In this way, 3048 aerial images have the size of 5482 × 3078, 5192 images have the size of 7360 × 4912. While for the rest of 2390 screenshots, their resolution is 8688 × 5792.
VSAI: VSAI dataset [33] is a dataset for object detection, which was collected by the National University of Defense Technology in 2022. In VSAI, 444 images are collected by different camera angles, flight height, times, weather conditions and illuminations. VSAI contains the bounding boxes of objects with two shapes, i.e., oriented bounding boxes (49 712) and arbitrary quadrilateral bounding boxes (47 519 small vehicles and 2193 large vehicles). The resolution of these data includes 4000 × 3000, 5472 × 3648, and 4056 × 3040. In order to further improve the generalization abilities of the object detection methods, VSAI also annotates the occlusion rate of objects. What is interesting, DTB70 is made up of two constitute parts, where some of the sequences collects the outdoor scenarios for the university campus with a 4 K camera mounted at a DJI Phantom-2 drone flying at the height of approximate 120-m high. While the other parts are supplemented from Youtube to introduce the diversity for the data distribution. Each frames are carefully annotated with horizontal bounding box same as some other UAV datasets, and the resolution for the respective video frame is 1280 × 720.
VisDrone: In addition to the object detection, VisDrone dataset also has the challenge sequences for object tracking. VisDrone 2018 single-object tracking task dataset contains 132 sequences with about 106 354 frames. Based on these data, Vis-Drone2019 provides 167 challenging sequences with 188 998 frames in total. Furthermore, VisDrone2020 provides 192 challenging sequences with 221 920 frames in total.
UAV123: UAV123 [38] is collected and proposed by KAUST (King Abdullah University of Science and Technology), in 2016, which involves 123 sequences and over 110 000 representative images from an aerial viewpoint. UAV123 is made up of three parts, including a professional DJI UAV, a tiny UAV with low cost, and a self-designed UAV simulator. Therefore, the resolution for the respective frame varies due to the difference for the captured platform. Aiming at supplement the gap for aerial object tracking, each frame is carefully annotated by the authors with horizontal bounding boxes and its corresponding attributes (i.e., occlusion, camera motion, illumination variation, aspect ratio change). In addition, the flying circumstance for these UAV platforms varies a lot (i.e., weather condition, flying altitude, scenarios), in order to enhance the variety and challenges for this dataset.
UAV20 L: The authors select 20 long-term video sequences in UAV123 to form up UAV20 L dataset. As a subset for UAV123, UAV20 L has 58 670 frames in total. According to the experimental results reported in the original article, most of the testing trackers perform inferior in UAV20 L when compared with their performance in UAV123. Such phenomenon could be attributed to the absence for redetection mechanism for the testing trackers, which has also pointed out a direction for object tracking field.
Anti-UAV: Anti-UAV [39] is collected by research team in the University of Chinese Academic of Sciences using two types of drones (DJI and Parrot) in 2021. The initial purpose for publishing Anti-UAV is to pioneer an interesting research field in the task of tracking UAV. Anti-UAV dataset comprises 318 RGB-T video pairs containing 585.9 k annotations, where the respective pair contains a thermal video and an RGB video. The videos cover a variety of backgrounds (e.g., tree, cloud, building), two light modes (visible and infrared), and two lighting conditions (night and day) at 25 FPS.
UAVDark135: UAVDark135 [41] refers to the first dark tracking benchmark based upon UAV platform, to make up the blank for tracking performance evaluation in dark environment. UAVDark135 comprises 135 sequences filmed with a standard UAV at night with 125 466 frames with manual annotation. The total frames, mean frames, maximum frames, and minimum frames of the benchmark are 125 466, 929, 4571, and 216, respectively. Meanwhile, UAVDark135 contains various scenes (e.g., lakeside, highway, street, ocean, and road) and covers considerable objects (e.g., bikes, trucks, athletes, buildings, cars, and pedestrians) making it suitable for large-scale evaluation.
HighD: Researchers in Aachen University collect 16.5 h of measurements to form up HighD [36] dataset, which contains 110 000 vehicles with 5600 recorded lane variation and 45 000 km driving distance in total. The videos are collected using a consumer quad-copter at the recording rate of 25 fps with the image resolution set as 4096 × 2160. In addition, HighD involves six different recording locations with different traffic circumstances during sunny and windless weather from 8 A.M. to 5 P.M. Different from other datasets, HighD is organized initially for safety assessment, but it could also be employed into the simulation and validation for vehicle counting, traffic analysis, and object tracking.
DarkTrack2021: DarkTrack2021 [42] covers 110 challenging sequences with 100 K frames, which are taken with 30 FPS at night-time in urban scenes. Similar to UAVDark135 [41], the original purpose for constructing this dataset is to provide a comprehensive assessment for tracking performance in illlighting status. The shortest, longest, and the average length of sequences are 92, 6579, and 913 frames, respectively. Dark-Track2021 provides abundant scenarios of in night-time real world with various challenges, including full-occlusion, low resolution, motion blur, and viewpoint variation.
UAVTrack112: UAVTrack112 [43], [44] is created from images captured and annotated during the real-world tests, which contains 112 sequences with 100 313 representative frames. The aim of establishing this dataset is for aerial tracking. Therefore, some cityscape scenes are also selected in this dataset. Same as DarkTrack2021 [42], this dataset is organized and maintained by Tongji University, China.

C. Semantic Segmentation
AVSD: AVSD [45] is designated as public by Beihang University, in 2020, which involves ten different sequences with total 525 pictures. 131 pictures out of all the 525 pictures are annotated manually. The sequences are captured at the speed of 12 fps with their resolution fixed as 1280 × 1024. In addition, there are six classes of targets in AVSD, namely, bare land, grassland, forest, building, road, and vehicles. The most challenging factor for AVSD is the variant motion and scene complexity for the collected video sequences.
UAVid: UAVid dataset [46] is jointly constructed by University of Twente and Wuhan University, in 2018, including 30 video sequences with the image resolution fixed as 4 K. In this dataset, 300 pictures are densely labeled with eight classes (i.e., background clutters, moving cars, humans, low vegetation, trees, static cars, roads, and buildings) for the urban scene understanding task. Noting that the authors also propose an in-house video labeling tool to automatically annotate the sequences in UAVid.
AeroScapes: The AeroScapes aerial semantic segmentation benchmark [47], was designed and organized by Carnegie Mellon University, in 2018. The imaging altitude for the commercial drone varies from 5-m high to 50-m high when constructing this dataset. According to the original article, AerosScapes is made up of 3269 pictures for 11 object classes with large-scale variation, viewpoint change, as well as scenarios composition.
ManipalUAVid: ManipalUAVid [48] is constructed and made publicly available by Manipal Institute of Technology, in 2019, which comprises 667 frames with four classes: road, construction, greenery, and water bodies. They are captured in six locations, such as the library, canteen, and hostel, with an image resolution of 1280 × 720. The presentation for ManupalUAVid greatly complements the gap in the direction of semantic segmentation using UAV platform.

III. METHODS
Taking the emergency landing of UAVs as the application background, this section presents a brief overview of methods for object detection, object tracking, and semantic segmentation of the UAV images and videos.

A. Object Detection
Recent advancements in deep learning technologies create large opportunities to study object detection in a previously inaccessible way. Existing object detection methods can usually be divided into two types: 1) two-stage detectors, where one model is adopted for the extraction of object region proposals and another model is adopted for classifying and refining the object localization, including fast R-CNN [53], faster R-CNN [54], cascade RCNN [55], etc. 2) One-stage detectors refer to models skipping the region proposal stage of the two-stage models and implementing detection over a dense sampling of locations, including the YOLOv1 [56], YOLOv2 [57], SSD [58], Reti-naNet [59], FCOS [60], etc. In general, the two-stage detectors achieve higher object localization and recognition accuracy, while the one-stage detectors are characterized by higher inference speed. Next, we introduce the detecting methods for the UAV environment in detail. Mittal et al. [21] reviewed the low-altitude UAV object detection based on deep learning. They proposed that low-altitude UAV-based object detection has more challenges compared with standard images, such as large-scale changes, densely distributed objects, arbitrary orientations, object relative motion, detection for small objects, class imbalance, and large-scale changes. In the following, we mainly review the representative deep learning-based UAV object detection methods [61], [62], [63], [64] with detectors at one stage and two stages.
One-stage Detector: To mitigate the real-time scene parsing challenges, Zhang et al. [65] developed the SlimYOLOv3 model to be capable of learning efficient deep object detectors via channel pruning of convolutional layers. For the problem of small object detection, Liang et al. [66] proposed a feature fusion and scaling-based single shot detector (FS-SSD), which incorporates the spatial object relationships into object redetection. To tackle the small objects in UAV images, Liu et al. [67] proposed a multiscale feature fusion algorithm, termed as dilated-attention-feature fusion SSD (D-A-FS SSD), with the combination of dilated convolution and attention mechanism. Liu et al. [68] developed UAV-YOLO for detecting small objects in UAV by enlarging the receptive field. To tackle the challenges of large-scale change and real-time problems, Li et al. [69] proposed the DSYolov3 model by adding multiple scale-aware decision discrimination networks, which involves a channel attention model and a sparsity-based channel prunning based on the YOLOv3 model.
Two-/Multistage Detector: To increase the resolution of objects in UAV images, Soleimani et al. [70] proposed a "yes or no" question answering framework with two steps for finding particular individuals conducting one or several actions within aerial pictures. For the detection of multioriented vehicles within aerial images and videos, Li et al. [71] developed a rotatable regionbased residual network (R 3 -Net). To tackle the small-sized pedestrian problems, Xie et al. [72] proposed a context-aware pedestrian detection approach, i.e., deconvolution integrated faster R-CNN (DIF R-CNN), to integrate the deconvolutional module into DIF R-CNN for acquiring additional context information. Yang et al. [73] developed a clustered detection (Clus-Det) network to unify the detection and clustering of the object within an end-to-end framework, covering a dedicated detection network (DetecNet), a scale estimation subnetwork (ScaleNet), as well as a cluster proposal subnetwork (CPNet). To solve the 1) large object size variation and 2) nonuniform object distribution problems, Li et al. [74] proposed a density-map-guided object detection network (DMNet), which involves a density map generation module, an image cropping module and an object detector. Liu et al. [75] proposed a high-resolution detection network (HRDNet) to take multiple resolutions with multidepth backbones as inputs. HRDNet involves a multiscale feature pyramid network (MS-FPN) and multidepth image pyramid network (MD-IPN) to optimize the detection of small objects and keep the performance of large-scale and middle-scale objects. Wu et al. [76] developed a dubbed nuisance disentangled feature transform (NDFT), which utilizes free meta-data with relevant UAV images to learn domain-robust features via an adversarial training framework.

B. Object Tracking
Object tracking can fall into two types: 1) generative tracking and 2) discriminative tracking. Generative tracking methods, such as Meanshift, Camshift, optical flow method, and particle filter, are capable of building a target model to extract target features and perform similar feature searches within subsequent frames. The discriminative model reveals that the target model and background information are both considered in the training process [77], [78]. The discriminative model acquires the target location in the current frame by comparing the differences between the background information and the target model. The discriminative model primarily has two directions: one is DCF-based methods, including MOSSE [79], CSK [80], KCF [81], and SAMF [82]; another is DL-based methods, such as MDNet [83], TCNN [84], and Siamese network [85]. Next, we review the representative discriminative tracker models for UAV object tracking [86], [87].
DCF-Based Tracker: Huang et al. [88] developed an aberrance repressed correlation filter (ARCF) to repress the aberrances in UAV object detection. By restricting the alteration rate in response maps generated at the detection phase, the ARCF tracker is capable of suppressing aberrances and exhibiting robustness and accuracy in tracking objects. Ye et al. [89] developed a multiregularized correlation filter (MRCF) through the regularization of the reliability of channels and the deviation of responses. The MRCF tracker can lead to adaptive channel weight distributions and smooth response changes simultaneously, which can effectively adapt to object appearance changes and enhance discriminability. In order to tackle the internal and external interference, Han proposed a state-aware anti-drift tracker (SAT) [90] by jointly learning the feature of the target and its surrounding patches. Afterward, Han et al. and Yuan et al. [91], [92] proposed several spatial-temporal contextaware tracking algorithms in accordance with DCF. Specifically, these models can learn a spatial-temporal context weight so that the target and background can be precisely distinguished under the UAV-tracking conditions. Furthermore, considering the aerial view and the small object scale under UAV-tracking scenarios, both of these DCF-based trackers incorporate the spatial context information to reduce background interference. Li et al. [93] proposed a spatially local response map change as spatial regularization, capable of learning spatio-temporal regularization terms online adaptively and automatically. Targeting the UAV tracking at night, Ye et al. [42] proposed a spatialchannel transformer-based low-light enhancer (SCT), which is trained based on the inspiration of a new task. Specifically, they developed a novel spatial-channel attention module for modeling information worldwide and retaining local context. During the enhancement process, SCT simultaneously denoises and illuminates nighttime images based on a nonlinear curve projection. Fu et al. [94] proposed a novel tracker learned by dynamic regression with automatic distractor repression (DR-Track), where the regression label is controlled dynamically for repressing distractors indicated as the local maximums. Yang et al. [95] transformed the large-scale least-squares problem in the spatial domain into several small-scale problems with constraints in the Fourier domain, using the correlation filter method to solve the real-time problems in UAV tracking.
DL-Based Tracker: The emergence of deep learning has brought a significant leap forward in visual object tracking filed, especially for the out-door scenes. Zhang et al. [96] proposed a coarse-to-fine deep scheme for tackling the ratio change problem in UAV tracking. First, the coarse-tracker generates an initial estimation of the target object, and then a sequence of actions is learned for fine-tuning the four boundaries of the bounding box. Jiang et al. [39] proposed a dual-flow semantic consistency (DFSC) method for UAV tracking. Under the modulation by the semantic flow across video sequences, the tracker can learn more robust class-level semantic information and obtain more discriminative instance-level features. To tackle the multiobjects tracking problem in UAV videos, Yu et al. [97] proposed a Siamese network to estimate global motion information in UAV video, which leverages the conditional generative adversarial networks (GAN) to produce the final motion prediction. Han et al. [12] combined the efficient DCF-based tracker with the precise DL model to eliminate the accumulating drift for the vehicle tracking. To be specific, the prediction for DCF tracker is incorporated as the input for a boundary regressing network, which are designed to correct the target's boundary, aiming at achieving a long-term tracking. Siamese models are employed to verify hand signature first [98], [99], and are gradually extended to object tracking task. Thanks to the powerful feature representation capability for CNNs, Siamese models present a great potential, as concluded in relevant surveys [100], [101], [102], [103], [104]. Although DL-based trackers could accomplish higher performance, it still face the difficulties in deploying efficient GPUs due to the limited size and computation resource on UAV platform. After all, the interference time for DL-based algorithms is relatively long, which could not meet the real-time standard for aerial tracking. Fu et al. [23] reviewed the research progress of Siamese trackers [105] and the development for high-performance embedded devices [106], pointed out a potential direction for Siamese UAV tracking.

C. Semantic Segmentation
Semantic segmentation aims to associate a label or category with each pixel in an image and identify collections of pixels that constitute different types [107], [108], [109]. There are two main types of semantic segmentation research: 1) the probabilistic graph model, such as [45] and 2) the DL-based methods [110] that have emerged over the past few years. The probabilistic graph model, such as Markov random fields (MRF) and conditional random fields (CRF), establishes a probabilistic model with a graph to express the conditional dependence structure between random variables. It can model the joint probability distribution of the related image entities to perform semantic segmentation. At the same time, the rapid development of deep learning in computer vision also provides a basis for its application in remote sensing imagery [111]. The progress of the convolutional neural network in the pixel-bypixel classification of images is based on massive data, such as Pascal VOC [112] and MS-COCO, in daily scenes. The remote sensing images are different from the daily scene ones with the characteristics of high spatial resolution, complex scenes, and numerous targets. Since then, more and more research has focused on applying CNNs to various remote sensing tasks. In the following, we mainly review the semantic segmentation of UAV images [86], [87] with a probabilistic graph model and DL-based methods.
Probabilistic Model: Yao et al. [113] constructed a triplemultipyramid structure, which combines the multiresolution, multiregion adjacency graph (RAG), and multisemantic elements. Kong et al. [114] exploited the geographical information of the region of interest in the form of a digital surface model (DSM) for urban UAV images semantic segmentation, which combines the visual features, DSM information, and a multiscale strategy with attention to improve the segmenting results.
DL-based Model: Sherrah et al. [110] proposed a deep fully convolution networks (FCN) without downsampling to obviate the need for deconvolution or interpolation. To more effectively exploit image features, they fine-tune the pretrained CNN on remote sensing data with a hybrid network. Kampffmeyer et al. [115] targeted the class imbalance problem for small objects. They use recent uncertainty measurement advances in CNNs and assess their qualitative and quantitative quality in a remote sensing context. Specifically, they adopt different deep architectures to cover the patch-based and so-called pixel-to-pixel methods and their integration for semantic segmentation. Maggiori et al. [116] derived a CNN framework adapted to the semantic segmentation problem, which can learn features at different resolutions and learn how to combine the above features. Girisha et al. [117] created a novel semantic segmentation dataset annotated manually. Moreover, they explore the performance of semantic segmentation algorithms for aerial videos achieved with the FCN and U-net architectures. Girisha et al. [118] proposed an enhanced encoder-decoderbased CNN architecture (UVid-Net) for UAV video semantic segmentation. The encoder can embed temporal information in terms of temporally consistent labeling. The decoder introduces the feature-refiner module to improve the location of the class labels.
Besides the semantic segmentation of images, video segmentation aims to divide pixels with consistent appearance and motion in video frames into continuous spatio-temporal communities. Video segmentation can be brought into remote sensing applications as a preprocessing module for further highlevel applications. However, research on remote sensing video segmentation is extremely rare. Cheng et al. [119] developed a video segmentation algorithm by an expert mixture for aerial surveillance video. They employ trainable sequence maximum posterior probability for supervised image segmentation algorithm, mean-shift unsupervised image segmentation algorithm, and moving object detection algorithm. With the domain knowledge of aerial video surveillance, the outputs of the above three experts can be effectively combined to generate the final segmentation result. Teutsch assessed various object segmentation methods according to machine learning [120], blob extraction, and contour extraction. They proposed a local sliding window method with an AdaBoost classifier and integrated channel features. Wang proposed the S-MRF approach [45], which is a principled combination of superpixel labeling priors and the Markov random field for UAV semantic segmentation. Specifically, S-MRF utilizes the UAV metadata for motion estimation, followed by the superpixel labeling prior and MRF optimization.

IV. EVALUATION METRICS
In addition, we need the evaluation metrics to quantitatively demonstrate the effectiveness of the object detection, object tracking, and semantic segmentation methods. In the following, we introduce some of the most commonly used evaluation metrics in these tasks.

A. Object Detection
We can measure the object detection methods from three aspects: 1) localization accuracy, 2) classification accuracy, and 3) efficiency.
Localization Accuracy: IoU is the most commonly used metrics, which calculates the ratio of the intersection and union of two sets of true and predicted values, generally represented as where T P , F P , T N, and F N denote true positive, false positive, true negative, and false negative, respectively. Classification Accuracy: There are a lot of metrics such as Accuracy, Confusion Matrix, Precision, Recall, and AP.
Accuracy is defined as the correct predicted samples divided by the total samples Precision is defined as the ratio of the true positive samples in the data predicted as positive samples Recall always accompanies Accuracy, which calculate the ratio between the predicted positive samples and total positive samples Usually, the Precision-Recall curve is used in the object detection task to show the tradeoff between precision and recall of the classification.
Average precision (AP) and mean average precision (mAP) are another two important metrics in object detection algorithms.
AP is the area under the Precision-Recall curve. mAP is the average of multiple class APs. For both AP and mAP, the higher, the better. F1-score is the harmonic mean of precision and recall, which is calculated as follows: Receiver operating characteristic (ROC) is another common used metric, in which x-axis and y-axis represent FPR and TPR, respectively. When the TPR is larger but the FPR is smaller, the classification result is better.
Efficiency: FPS is always used to measure how many images are processed per second. The larger, the faster. Also, some works also measure the memory usage during the running time.

B. Object Tracking
Generally, researchers employ one-pass-evaluation (OPE) methodology [121], [122] to validate the accuracy and robustness for SOT algorithms. Each comparison trackers are initialized with the target's state (location and scale) given at the first frame of the video. Afterward, the tracking result is recorded for each subsequent frame no matter the tracker is located on the target or not. Based upon the OPE manner, two metrics (precision and success rate) are incorporated to evaluate the performance of comparison methods.
Precision Rate: Precision rate illustrates the percentage of the frames whose center location error (CLE) are within the given threshold between the predicted center for the candidate tracker C pr with the one for the annotated bounding-box C bb . Generally, 20 pixel is set as the threshold to determine whether the tracker is drift in each frame. However, for some specific scenarios, the threshold may be adjusted according to the target size. Since the precision rate varies across different videos, researchers generally average the precision score for all the sequences to obtain a comprehensive evaluation for the participated tracking algorithm on a certain dataset. It should be also noted that the precision metric can be easily affected by the image resolution and the bounding box scale, the normalized precision metric is also employed for performance evaluation in some literature by normalized the center location error over the scale for the bounding box Success Rate: Success Rate is based upon the overlap ratio OP , which is defined as the intersection over union for the area of the annotated bounding box A bb and the predicted one for candidate tracker A pr . The success plot shows the percentage of the frames, where the overlap ratio is larger than a predefined threshold. In this way, we could obtain a continuous curve by linking the success rate under different threshold and the area under the curve (AUC) could be served as the second measure metric to rank the trackers

C. Sematic Segmentation
Pixel Accuracy (PA): PA represents the ratio of correct predictions for all pixel classes to the total number of pixels (8) where (k + 1) is the total categories with k foreground categories and 1 background category. p ij represents that the pixel of class i is predicted to be class j. When i = j, the prediction is correct, otherwise the prediction is wrong.
Mean Pixel Accuracy (MPA): Different from PA, MPA calculates the ratio of correct predictions to the total number of pixels in that category, then average the results for all categories Mean Intersection over Union (MIoU): In semantic segmentation, MIoU can be represented as the mean of the IoU among all categories Frequency Weighted Intersection over Union (FWIoU): FWIoU is an improved version of MIoU. The difference between FWIoU and MIoU is the weighting way. MIoU applies the same weight 1 k+1 to each category, while FWIoU uses the ratio between the number of each category and the total number as different weights for different categories V. UPCOMING DOMAINS AND FUTURE OPPORTUNITIES This work reviews several popular UAV task-related datasets and methods for Earth observation. This section summarizes some potential future directions for the UAV vision area.
First, the recent work lacks systematic validation. Most UAV vision tasks only rely on a certain or few datasets to validate the performance of their methods. They did not evaluate their methods on extensive datasets and for the various characteristics of the UAVs. Therefore, establishing a benchmark for evaluating different methods on extensive datasets for various characteristics of UAVs is a very useful direction in the future.
Second, real time is a significant problem in the UAV vision area. In recent years, deep learning has become a popular method for dealing with visual tasks because of its powerful recognition ability. However, it always requires a lot of computing resources. On the other hand, small UAVs cannot load with large devices such as GPUs. Therefore, how to perform real-time visual tasks on small devices is an urgent problem to be solved. Current researches focus on how to do vision tasks on the data captured by UAVs, and few works consider real-time problems.
Third, the technological advances in the UAV visual field are extending our capability at a breakneck speed, enabling many other data modalities of individual image data to be taken. For instance, recent development in imaging provides the opportunity to analyze infrared or other data modalities from different sensor devices, holding great promises to further transform the UAV visual field. Given the different modalities of such data (e.g., visible light, infrared), we can merge them before applying them to our tasks and anticipate UAV visual techniques to be readily adopted for the above data types when they become more available.

VI. CONCLUSION
The explosion of UAVs over the past few years has resulted in a resurgence in designing and employing the corresponding vision techniques for analyzing UAV data. In this study, we revisited and summarized the datasets for object detection, object tracking, and semantic segmentation methods for UAVs in the last decade. Subsequently, we reviewed the recent literature for their applications, summarized the achievements, and identified the missing aspects. Finally, we provide several research directions and practical considerations that we hope will spark future research in the application of the UAV vision era, such as the comprehensive study, real-time problem, and multimodality information.