DroneSense: The Identification, Segmentation, and Orientation Detection of Drones via Neural Networks

The growing ubiquity of drones has raised concerns over the ability of traditional air-space monitoring technologies to accurately characterise such vehicles. Here, we present a CNN using a decision tree and ensemble structure to fully characterise drones in flight. Our system determines the drone type, orientation (in terms of pitch, roll, and yaw), and performs segmentation to classify different body parts (engines, body, and camera). We also provide a computer model for the rapid generation of large quantities of accurately labelled photo-realistic training data and demonstrate that this data is of sufficient fidelity to allow the system to accurately characterise real drones in flight. Our network will provide a valuable tool in the image processing chain where it may build upon existing drone detection technologies to provide complete drone characterisation over wide areas.


I. INTRODUCTION
The proliferation of semi-autonomous aerial vehicles, i.e. drones, into the consumer and industrial spaces, combined with the growing number of drone related incidents (infractions into commercial airspace, [1], [2] or the use of drones by militant groups, [3], [4]) has raised concerns over the ability of existing aerial detection systems to accurately characterise such vehicles [5]- [7]. Specifically, many existing air-space monitoring technologies are optimized to detect the presence of a vehicle, identify its type, and, track its position over time but they lack the resolution to determine target specific features. This, in conjunction with drones ability to decouple their motion in space from their assigned task e.g. simultaneously translate and rotate to keep a subject in frame whilst filming, means that presence, type and position are often insufficient to accurately identify the intent of a vehicle.
To accurately assess the intent of a drone it is necessary to fully characterize its 'pose' i.e., not only identify its type but also segment it into functional components and identify the The associate editor coordinating the review of this manuscript and approving it for publication was Felix Albu . orientation of these components in 3D space. Fig. 1 conceptually illustrates this process showing a DJI Mavic 2 drone segmented into colour-coded components and placed within a 3D Gimbal corresponding to its orientation.
To address the problem of drone characterization a wide variety of machine learning assisted drone detection systems have been developed. For example, radio based methods, which eavesdrop on the communications between drones and pilots and apply the statistical analyses of control signals [8]- [11], Convolutional Neural Networks (CNNs) analysing the spectragram [12]- [15], K-Nearest Neighbours (KNNs) [16] clustering of signals, cyclostationary feature extractors [17], decision trees [18] and random forest techniques [19], bit-analysis [20], and, residual [21], recurrent [22] and hierarchical networks [23]. Additionally, acoustic based methods analysing the noise of a drones motors and propellers have also been developed using Mel Frequency Cepstral Coefficients (MFCC) [24]- [29] or by converting the signal to a spectragram [27], [30], [31]. Once obtained, the MFCC or spectragram feature set can be used to train Long-Short Term Memory (LSTM) models [24], or Convolution type models such as CNNs [31]- [36], Recurrent FIGURE 1. A conceptual representation of drone pose. A drone (here represented by a DJI Mavic 2) is identified and divided into its components, for instance, body (blue), engines (red), and, camera (green). Further, the orientation of the drone in 3D space as represented by the roll (magenta), pitch (orange) and yaw (cyan) gimbals is identified.
Despite the relative efficacy of acoustic and radio based systems the introduction of quiet micro-drones and fully autonomous drones (which do not require radio commands) has rendered them progressively less versatile and has necessitated the development of radar and optical based sensor systems. Radar in particular has seen extensive development including pulsed systems [40]- [42], Doppler systems [43]- [47], and Frequency Modulated Continuous Wave (FMCW) systems [48]- [50] all at multiple wavelengths [51]- [60]. The reader is directed to Refs [61]- [64] for a comprehensive review. Whilst radar based systems are able to monitor a large area and are robust to atmospheric conditions, their reliance on micro-Doppler information for drone type identification and poor transverse resolution has prevented their application to problems beyond target detection and tracking. Hence, in parallel to radar systems, machine learning assisted optical drone detection systems have been developed. Such systems have been extensively used to identify the presence of drones in an image and construct bounding boxes at ranges comparable to that of radar systems [65], [66].
The most common approach to optical drone detection is to train existing CNN based networks such as You Only Look Once (YOLO) [67], [68] and ResNet [69], [70] on colour camera images. These networks include coupling to pan-tilt and zoom camera mounts to track moving objects [71], using multi-camera systems to increase the field of view [72], [73], utilising the high speed nature of YOLO to identify drones at video frame rates [74], comparing the performance of YOLO v2 and YOLO v3 on drones at short range against static backgrounds [75], examining the effect of incorrect images labels on YOLO [76] and, modified YOLO implementations [77].
More complex optical CNN architectures have also been developed where features in the image (such as moving objects) are enhanced before being sent to a second network for identification. These multi-stage networks have proven to generally be more effective at discriminating drones from drone-like objects in images such as birds [78]- [81]. Such networks have been developed using background subtraction with image stabilization [82] and CNNs [83], [84], subtracting successive frames and clustering using an SVM [85], HAAR filters [86] for edge and feature detection, foreground background separation [87], ResNet for feature extraction and SVMs for classification [88], Kalman filters and ResNet [89], Faster-RCNN and ResNet [90], and, using trajectory mapping to suppress erroneous YOLO identifications [91]. Additionally, several other networks have been used for drone identification. These include identifying regions of interest in an image [92] using Histogram of Gradient (HOG) descriptors with thresholding or Fourier descriptors [93], simultaneous image upsampling and downsampling [94], Inception Net v3 [95], generic Fourier descriptors [96], [97], Faster-RCNN [98], and, TIB-Net with CenterNet, lightweight networks optimised for speed of processing [99], [100].
Finally, a number of more niche applications have also been investigated such as, controlling the flight of a drone based on external camera observations [101] and, using cameras mounted on multiple drones to track and even intercept hostile drones [102]- [105]. For a review of the different machine learning implementations listed above the reader is directed to Refs [106]- [108]. Despite the numerous optical systems developed to date characterisation of drones beyond presence, location and type remains rare with demonstrations limited to determining if a drone is carrying a payload [109] or the identification of key points on a single drone at short range [110].
A promising avenue for the more complete characterization of drones is given by sensor fusion in which multiple sensors are combined. For example, using a large field-ofview low resolution sensor to direct a small field-of-view high resolution sensor with such systems seeing improvements in performance of up to 15% [72], [111], [112]. In the case of optical drone detection systems one such example is the development of depth sensing time-of-flight systems such as LIDARS. LIDARS active illumination allows them to operate when no passive light source is available (such as at night), detect targets which themselves emit no thermal radiation, and, operate to a limited degree through obscurance. Scanning LIDARS have been shown to be effective at drone detection at ranges up to 2 km when coupled with a Variable Radially Bounded Nearest Neighbour (V-RBNN) network to analyse the point cloud [113], [114]. Further, flash LIDAR systems such as those employing Single Photon Avalanche Detector (SPAD) array cameras allow for the simultaneous capture of a 'traditional' high transverse resolution intensity image as well as a lower transverse resolution depth image (where 'depth' refers to the distance between the camera and the object for each pixel). Such systems have been shown to be effective at identifying the pose of objects at short range [115], [116] but have yet to be applied to the problem of drone characterization.
Here, we present a CNN which provides the complete characterization of drones. The network takes as input an intensity image as well as depth data and outputs: the identity of the drone i.e., the type of drone in the data; the segmentation of the drone in which each pixel in the intensity image is classified according to the drone component it represents; and, the orientation, the angle of the drone about its three principle axes of rotation, yaw, pitch, and, roll. We examine the performance of the network in multiple scenarios including, different drones, different ranges of motion and different data inputs. We assume that our system is being used in an image processing chain where supplementary systems such as radars would have already distinguished the drone from drone-like objects (e.g. birds) and would be able to direct a small field-of-view camera at the drone. We outline a system for producing large quantities of accurately labelled simulated data on which we train our network. To verify both our network structure and our simulated training data we demonstrate the ability of our network to accurately characterize an image of a real DJI Mavic 2 Zoom drone in flight as captured by a Quantic 4 × 4 SPAD camera [117]. The SPAD camera represents a state-of-the-art sensor fusion system combining a functional transverse resolution of 80 × 240 pixels for intensity and 20 × 60 pixels for depth. Further, each depth pixel outputs a depth histogram with 500 picosecond temporal resolution. Finally, the architecture of the chip has the potential for the alternating acquisition of visible spectrum intensity images and depth histograms at rates in excess of 1000 frames per second [115].

II. NETWORK ARCHITECTURE
We present a network architecture built on a decision tree coupled with an ensemble network. The decision tree identifies the type of drone after which a set of drone-specific pretrained networks are applied in parallel to perform the orientation and segmentation operations. Specifically, the orientation is determined by three identical networks each trained to identify a single axis (roll, pitch or yaw) while the segmentation is performed by an additional U-Net [118] type network. This structure allows multiple drone parameters to be identified simultaneously through network parrallalization whilst allowing each network to be optimized on a specific parameter yielding superior overall performance.
The lack of high quality drone image training datasets remains an obstacle for machine learning assisted drone classification. To address this, several publications have examined data augmentation [119] techniques such as,  Processed images from unreal engine compared to quantic 4 × 4 SPAD camera images. a) The intensity and depth images produced by the unreal environment. b) The data used to train the network. The intensity image is noised with a poisson filter while the depth image is down-sampled and converted to a histogram of depths (visualised here as a depth image). c) Images captured by a quantic 4 × 4 SPAD camera of a real drone in flight. Note that the intensity images have been enhanced in contrast for better visualization.
super-imposing drone images onto unrelated backgrounds [120], super-resolution upscaling [121] and, generating new images from Generational-Adverserial Networks (GANs) [122]. Here, we leverage the capability of the Unreal Engine video game development environment to rapidly produce a large set of photo-realistic, accurately labelled training data as illustrated by Fig. 2. This approach allows us to explore the parameter space of drone types, orientation limits (e.g. the upside down Inspire 2 in Fig. 2), lighting conditions and image qualities to an extent which would be impractical experimentally. Further, our model could be readily extended to include numerous different backgrounds and weather conditions. The Unreal code is publicly available and can be found at https://github.com/HWQuantum/DroneSense. Fig. 3 shows examples of the images processed by the network. The simulated images produced by the Unreal environment (Fig. 3a)) are processed before being passed to the network. The simulated intensity image is noised with a Poisson filter (Fig. 3b)) and resized to 80 × 240 pixels, while the depth is downsampled and converted to a histogram with a dimensionality of 20 × 60 × 15. Fig. 3c) shows the images produced by a Quantic 4 × 4 SPAD array camera of a real drone in flight which the simulated data is designed to mimic. We stress that the image sizes used in the simulated data were selected only to match the physical parameters of the Quantic 4 × 4 SPAD sensor, and the network can be reshaped to any dataset with both intensity and depth information. The images generated by the model could easily be adapted to match those obtained with a different camera. Additionally, the ability for the SPAD to isolate a volume of space using time-of-flight gating means that the background of the images may be neglected. Fig. 4 shows a summary of the identification, orientation and segmentaion networks. At the core of these networks is the Drone Feature Encoder (DFE) which reduces the input data to a latent feature space. The DFE takes as input both a histogram of depths (of size r h rows, c h columns, and p h pages) and an intensity image (of size (r i , c i )). The histogram is passed twice through two 3D convolutional layers (each with 32 filters) and axial max-poolings to extract its depth features and reduce it to a dimensionality of (r h , c h , 1). The intensity image is passed through two 2D convolution layers (each with 32 filters) and a max-pooling such that it is reduced to a dimensionality of (r h , c h , 1). The intensity and depth tensors are then concatenated and passed twice through a set of two 2D convolutions (each with 32 filters) and maxpoolings ultimately reducing the network inputs to a latent space of 1 × 3 × 32 filters. The DFE is identical in all the networks with each network distinguished by how it handles the data in this latent space.
In the case of the identification network which defines the decision tree, the latent space is flattened to a 96 element vector and connected to a dense layer with 64 neurons. These neurons are in turn connected to the single output node with a Sigmoid activation. This network is trained using crossentropy as a loss function such that it outputs an integer corresponding to the type of drone in the image. The orientation networks are identical in structure to the identification network, but the final neuron uses a ReLu activation. ReLu activation allows the neuron to output a continuous value corresponding to the angle in a given axis. Further, the orientation networks are trained using the loss function given in Eqn. 1 which allows them to correctly account for the cyclic nature of angle prediction and handle the discontinuity in prediction between 360 • and 0 • .
where l is the label, p is the networks prediction, and, abs is the absolute value function. The segmentation network The networks take in a high transverse resolution intensity image and a low transverse resolution histogram of depth. Using convolution, pooling, and, concatenation the inputs are reduced to a dense latent space. The identification network connects this latent space to a dense layer and then to a single Sigmoid activated neuron for drone type classification. By contrast, the three orientation networks use an identical structure but employ ReLu activation in the final neuron to output a continuous value corresponding to the angle in a given axis. Segmentation is performed by up-sampling the latent space to a final convolution with filters corresponding to the components being identified.
attaches a U-Net to the DFE. This U-Net up-samples the latent space to a set of segmentation predictions of size (r i , c i , n) where n corresponds to the number of components being identified. Each layer of the U-Net mirrors the DFE, undoing the max-pooling and using skip connections to concatenate the tensors. These concatenated tensors are then passed through two 2D convolutions each with 32 filters. The network was trained using binary cross-entropy with the ADAM optimizer and a learning rate of 0.001 with no drop out. The final output is a single convolutional layer with (in this case) three filters corresponding to the three components being identified; the body of the drone, the engines of the drone, and, the camera on the drone.

A. RESULTS ON SIMULATED DATA
Two drones were used for testing, a DJI Mavic 2 Zoom and a DJI Inspire 2. High fidelity models of these drones were placed in the Unreal environment and a total of 72 000 simulated SPAD images generated. The images feature the drones at random positions within the field-of-view, at random orientations, and, at random distances from the SPAD camera, ensuring sufficient variation in the data. From the training images, 10% were reserved for network testing. We do not make use of any image augmentation, although this could be used to increase the total number of training images. The networks were trained until the loss converged and the networks with the best performance on the testing data saved. These models were then validated on a separately generated set of 3600 unseen validation images. This ensured no chance of the network overfitting to the validation data. A summary of the results for the identification, segmentation and orientation networks is presented in Table 1. The final trained parameters of the model are specific to the images that we use for training, and these images are closely matched to those generated by the Quantic 4 × 4 sensor. Images collected with a different sensor could be used with this model, however, the optimal performance will always be achieved if the model is retrained with the appropriate images.
To ensure non-negative angular values in all drone orientations a coordinate system was established in which level flight facing away from the camera corresponded to, yaw = 180 • , roll = 180 • and pitch = 90 • . Within this coordinate system two angular regimes were examined, the 'full angle' regime and the 'reduced angle' regime. In the full angle regime the drone models had the following range of motion: yaw By examining the radial distribution of the predictions, the accuracy of the networks in each axis and in each regime can be compared and the following observations made. First, the accuracy of the networks is contingent upon the number of images-per-angle the network is given to train on. In the full angle regime where the pitch is restricted to half the range of the roll and yaw the network accuracy improves significantly since for the same number of total training images the number of examples-per-degree is doubled to ∼400. This is also why in the reduced angle regime where the roll is restricted its accuracy matches that of the pitch, while the yaw does not, even when the same total number of training images is used. Second, the accuracy of the networks is coupled i.e., for a reduced range of motion in one axis the accuracy of the remaining axes will increase. While this effect is less pronounced than that of examples-per-degree it can be observed in Table. 1 where a 4 • increase in yaw accuracy is observed for both drones in the reduced angle regime. This despite the range of motion in that axis remaining constant. The improvement can be attributed to the reduced variance (roll and pitch range) in the images which the yaw network must learn. Third, the accuracy of the networks is somewhat contingent on the symmetry of the drone. Specifically, the Mavic 2 is nearly perfectly symmetric about its roll axis, consequently the accuracy of the Mavic 2 roll network in the full angle regime is the worst. This is because there are the fewest features to unambiguously identify the roll at angles outside of a 90 • to 270 • range.
Examining the Intersection over Union (IoU) scores in Fig. 6 it is apparent that the networks can effectively segment both drones into their components regardless of their orientation. The score relating to the 'body' label is the highest in all cases indicating that the network is most accurate at predicting this component. This is likely because it is the most prevalent in terms of pixels in the image. Additionally, the fact that the rows and columns of the IoU scores do not sum to 100 indicates a conservative predictor. This means the network leaves some pixels (particularly around the perimeter of the drone) unclassified, reducing the total accuracy but also minimising misclassification.

B. REDUCED INPUT RESULTS
To further examine the functioning of the networks an ablation study was conducted. Specifically, the effect of removing one input channel, either the histograms or the intensity was quantified. Given that all networks share the DFE it was VOLUME 10, 2022 FIGURE 7. The results of the orientation prediction networks for the Mavic 2 drone in the full angle regime when trained using only an intensity or depth input. The theta coordinate represents the angle with the solid red ring indicating the ground truth. The radial coordinate represents the error (up to a maximum of ±180 • ). Network under and over predictions fall inside of and outside of the red ring respectively. Predictions made by the networks trained on both inputs are shown as blue triangles. Predictions made by the networks trained on only intensity or depth data are shown as orange stars and yellow squares respectively. determined to be sufficient to retrain only the orientation network for the Mavic 2 in the full angle regime since changes in performance in this network would be indicative of changes in all networks. Table 2 presents a summary of the findings with the network predictions visualized in Fig. 7. Table 2 and Fig. 7 indicate that the orientation of a drone can be more accurately determined from a depth input than an intensity input although the relative improvement is small. It should be noted however, that the images on which the network was trained do not contain a background. In real world cases where drones could be optically camouflaged the ability for depth sensing devices to isolate volumes of space ahead of background objects using time-of-flight gating may significantly enhance their robustness in orientation detection.
Additionally, given that the segmentation network can only reliably produce images up to the size of its largest input TABLE 2. Summary of the Mavic 2's orientation network accuracy when trained using only an intensity input or a depth input in the full angle regime.

FIGURE 8.
Summary of the Mavic 2's orientation and segmentation network accuracy when trained using inputs at one half and one quarter resolution. The colour panels provide a qualitative illustration of network performance while the numbers report the accuracy and standard deviation as well as the change in those quantities with respect to the full resolution results in Table 1. Generally, as input resolution is reduced network performance worsens particularly in respect to the segmentation of small components on the drone, such as the engines.
(due to its U-Net structure) there is a benefit to providing the network with a high transverse resolution image. This benefit is illustrated in Fig. 8. Fig. 8 shows the degradation in accuracy of the yaw orientation and segmentation networks when trained on inputs at one-half and one-quarter of the original resolution, as may be the case for a drone which is further away (or smaller) or a lower resolution sensor. As the resolution decreases both the accuracy and the precision (as shown by the increase in standard deviation) of the network decreases. This effect is particularly apparent in the segmentation of small features such as the engines and camera where low resolution images fail to retain the component-specific features on which the network relies for identification. Fig. 8 further illustrates the benefit of sensor fusion approaches which combine depth information with high transverse resolution images.

C. RESULTS ON REAL DATA
To demonstrate the real world applicability of our system, we applied the reduced angle network (trained only on simulated data) to an image of a real DJI Mavic 2 Zoom drone captured in flight using a Quantic 4 × 4 SPAD camera. Fig. 9 summarizes the predictions made by the networks and highlights their ability to fully characterise drones in real world conditions. The network correctly identified the drone type and suffered only a small loss in accuracy when performing the segmentation and orientation operations. This reduction in accuracy can be attributed to the reduction in quality between the simulated data and the input data from the Quantic 4 × 4 (as seen in Fig. 3 c))

IV. CONCLUSION
We present a CNN using a decision tree and ensemble structure to fully characterise i.e., determine the type, orientation and segmentation of drones in flight with accuracies in excess of 90%. We provide a system for the rapid generation of large quantities of accurately labelled photo-realistic training data and demonstrate that this data is of sufficient fidelity to allow the system to accurately characterise real drones in flight. Our network provides a valuable tool in the image processing chain and can be used in combination with existing drone detection technologies to provide complete drone characterisation over wide areas. Finally, our approach may be readily extended to multiple 3D imaging and sensor fusion systems enabling pose detection for a wide range of vehicles.
STIRLING SCHOLES received the B.Sc. degree (Hons.) in physics and the M.Sc. degree (Hons.) in optics and photonics from the University of the Witwatersrand, Johannesburg, South Africa, in 2018 and 2020, respectively. He is currently pursuing the Ph.D. degree in applied imaging systems with the Quantum Optics Group, Heriot-Watt University, with a focus on data fusion approaches for high-speed tracking and identification of objects using 3D time-of-flight technology.
ALICE RUGET received the M.Sc. degree in electrical engineering from the CentraleSupélec, Gifsur-Yvette, France, in 2017, and the M.Sc. degree in biomedical engineering from ETH Zürich, Switzerland, in 2019. She is currently pursuing the Ph.D. degree in computational imaging for ultrafast imaging in three dimensions with the HW Quantum Group, Heriot-Watt University, Edinburgh, U.K. Her research interest includes the development of machine learning algorithms to enhance the image quality for different imaging systems, such as singlephoton sensitive detectors. He has written over 100 peer-reviewed articles in scientific journals. His research interests include applying classical and quantum optics techniques to solve problems in information and imaging science.