A Stethoscope for Drones: Transformers-Based Methods for UAVs Acoustic Anomaly Detection

Unmanned Aerial Vehicles and the increasing variety of their applications are raising in popularity. The growing number of UAVs, emphasizes the significance of drones’ reliability and robustness. Thus, there is a need for an efficient self-observing sensing mechanism to detect real-time anomalies in drone behavior. Previous works suggested prediction models from control theory, yet, they are complex by nature and hard to implement, while Deep Learning solutions are of great utility. In this paper, we propose a real-time framework to detect anomalies in drones by analyzing the sound emitted from them. For this purpose, we construct a hybrid Deep Learning based Transformer and a Convolutional Neural Network inspired by the well-known VGG architecture. Our approach is examined over a dataset that is collected from a single microphone set located on a micro drone in real-time. Our approach achieves an F1-score of 88.4% in detecting anomalies and outperforms the VGG-16 architecture. Moreover, the framework presented in this paper reduces the number of parameters of the well-known VGG-16 from 138M, into a shrunk version with 3.6M parameters only. Additionally, our real-time approach, results in a smaller number of parameters in the neural network, and yet yields high accuracy in anomaly detection in drones with an average inference time of 0.2 seconds per second. Moreover, with an earphone that weighs less than 100 grams on top of the UAV, our method is shown to be beneficial, even in extreme conditions such as a micro-size dataset that is composed of three hours of flight recordings. The presented self-observing method can be implemented by simply adding a microphone to drones and transmitting the captured audio for analysis to the remote control or performing it onboard the drone using a dedicated microcontroller.


I. INTRODUCTION
Unmanned Air Vehicles (UAVs) are nowadays used in many industries, such as the food industry [1], retail [2], healthcare [3], etc. In addition, They can be used in cinematography to follow stunt doubles during outdoor filming [4], and can even help with agriculture by doing redundant tasks like seeding, planting, or spraying [5]. Moreover, when combined with Artificial Intelligence, UAVs get incredible abilities like 3D modeling in the aftermath of an area where a disaster occurred for analysis, and even doing tasks mentioned above The associate editor coordinating the review of this manuscript and approving it for publication was Guillermo Valencia-Palomo . autonomously [6]. With all of these UAV features, UAVs are fragile and can be damaged or suffer from malfunctions, especially when they are autonomous [7]. While executing such tasks, UAVs can be damaged by insects or birds, which can cause damage to the UAV's blades or rotors. This can ruin the UAV's stabilization and render it incapable of flying straight. The problem may go unnoticed and the drone acts as if it is following properly on the predefined path, while in reality, it is diverging, causing the drone to miss the crops. As such, these kinds of anomalies should be detected in realtime by the UAV and should be reported immediately in order to prevent long-term problems preemptively [8]. There are cases where anomalies can be detected from the software FIGURE 1. A standard BLE (Bluetooth Low Energy) earphone is used as a ''stethoscope'' for sensing the drone's ''well-being''. The middle (main) image shows a Tello micro drone (sub 100 grams) with an earphone located a few centimeters above it (see also the lower left image). A minor anomaly in the propeller ''tip'' is shown in the upper left image. Another (more visible) anomaly is demonstrated by the blue propeller (which has slightly different parameters than the black ones). On the right side -two examples of the experiments are presented in outdoor scenarios; the upper image shows a relatively high altitude flight (about 8 meters above the ground, to avoid ''ground effects'') and the lower left image shows the drone flying in a low altitude (about 1 meter above the ground).
by monitoring the sensors and moving parts' actions and observations against certain thresholds, but the real world is much more complex and external influences are much harder to detect. In addition, UAV's blades can be damaged [9] by an unexpected object hitting the UAV, or the wind that can push it into a sturdy object. So the question is, how can we detect these anomalies in real-time? One can note in Figure 1 the overall suggested solution -A Bluetooth earphone is used as a ''stethoscope'' sensor for the drone.
Most of the UAVs today 'know' how to stabilize themselves, but the situations mentioned above disrupt this mechanism. When attempting to modify the UAV's blades, there is one thing that can be easily noticed, which is the noise that the UAV makes is different from its normal state. When a UAV gets hit, it immediately tries to re-stabilize itself, and this re-stabilization causes the rotors to spin unevenly, some faster than the others, to get the UAV back to its normal state. This process produces a sound that is different from the usual sound emitted from the UAV in its normal state. Moreover, a damaged [10] blade also emits a different sound than when the UAV is in a normal state [11]. The method proposed in this work for detecting the anomalies in the emitted sound from the UAV, uses a lightweight microphone mounted on the UAV, that can be connected to an external computer through Bluetooth or any other wireless connection. The audio stream is then passed to the external computer on which the proposed algorithm is executed. The algorithm uses the power of Deep Learning (DL) [12], [13] [14] to classify sound clips into anomalous or regular ones.
Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans, which is learning by example. This technique can be used for many different tasks including prediction of future events [15], classification of data to groups [16], generation of new data [17], Anomaly Detection (AD) [18], [19] and more. A Convolutional Neural Network (CNN) is a sub-class of DL architectures, most commonly applied to analyze visual imagery [20]. CNNs utilize the convolution operation, using kernels or filters that slide along input features and provide feature maps. CNNs can also be used for audio analysis [21], by turning audio into a 2D representation called a 'spectrogram'. Similar to an image, the data can be processed by a CNN, as a spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time.
The use of DL for AD is not a new thing [22] since DL is useful when working with data that has patterns. DL models are able to learn patterns from the data that was given to them as an example, which makes DL good at AD. An anomaly can be defined as the occurrence of an unusual event or a deviation from a common rule. Using DL, it is possible to learn to classify the activity of a system over time as anomalous or regular. AD in UAVs [8] is no different; the goal is to identify unusual activities or patterns, by monitoring a stream of sensor-based information coming from the UAV. A DL model can identify if the current input is anomalous or not by learning from examples [23]. The use of DL for AD in UAVs has many advantages [24], where one of them is the ease of use and implementation. The first methods proposed for AD in UAVs were based on models from control theory [25], which require a good understanding of linear and non-linear systems, and knowing how to do the required math to implement these methods, while DL only needs sufficient data. Moreover, today there are many frameworks and libraries that make it quick and easy to implement and train DL models.
In this paper we use a Transformer-based model [26], [27]. A Transformer is a DL model that adopts the mechanism of self-attention [28], differently weighting the significance of each part of the input data. Transformers are also used for AD in many different fields such as, aerial videos from a UAV's [29], system logs [30] and brain-scan images [31]. The self-attention mechanism of Transformers is very useful for AD, making it easier for DL models to recognize irregular activities in the input. The Transformer used in this paper takes an array that represents the sound frequencies of a second-long recording and outputs a matrix on which the selfattention mechanism is applied, then the CNN architecture processes it and outputs the probability that an anomaly has occurred. That is, we use two models from the Wav2Vec2 [26] group of transformers, which are the Wav2Vec2-Base and Wav2Vec2-ASR-960h, and compare their performance on the task of AD in UAVs. The Wav2Vec2-Base is a transformerbased model created for speech recognition and has been pre-trained on 960 hours of unlabeled raw speech data for speech recognition tasks. Wav2Vec2-ASR-960H is another transformer-based model, that has been trained and finetuned to identify English letters in raw sound. Generally, we denote our framework by Wav2BC+ which stands for the exploitation of the Wav2Vec2-Base and the VGG-based CNN. Wav2Vec2 has been shown to be highly effective with relatively low training data [32], [33]. This model takes raw audio as input and outputs an image-like representation of the input data, which can then be passed to a classifier that can differentiate between an anomalous and regular sound.
Finally, The contributions of this paper are highlighted as follows in Section I-A; A. OUR CONTRIBUTION 1) Development process of a compressed version of the well-known VGG-16 framework that is extremely smaller in terms of number of parameters in the neural network, that is capable of yielding high accuracy in anomaly detection in UAVs. 2) Our real-time approach achieves State-of-the-art performance compared to two baseline approaches, and high accuracy at detecting anomalies in non-ideal environments, which results in a working implementation of a system for real-time anomaly detection in UAV's. Namely, we outperform the traditional VGGbased CNN architecture by introducing a Transformer as feature extraction and acoustics embedding. 3) Quick training process over only 3 hours of recorded data, which leads to a fast convergence due to the use of Transfer Learning, from an acoustic pre-trained model. That is, we tackle a problem in acoustics using technologies used for speech recognition. 4) Exploitation of an earphone and minimal hardware so that it can be used on any size UAV without affecting its functionality. The remainder of this paper is structured as follows: Section II surveys related work on anomaly detection in UAV's, as well as DL approaches; Section III describes the data collection procedure as well as the organization and preparation for training the system; Section IV describes the proposed method in detail; Section V presents the results between the different approaches compared in this work; Finally, Section VI Summarizes this paper. For ease of reading, Table 1 provides a list of abbreviations that are commonly used in this paper.

II. RELATED WORK
The topic of AD in UAVs has been covered by quite a few works [34], [35] [36]. One of the first categories of approaches to this diagnostic problem utilizes model-based fault diagnosis with sophisticated [37] methods to evaluate model residuals and conclude on the fault's occurrence. In [25], the method is based on a nonlinear observer [38], which is an extension of linear observer [39] design techniques using transformations related to linear observability matrices. The work in [40] later boosts the method above by adding an adaptive observer, making it more dynamic, and then in [41], the method before is implemented for real-time application on multirotor UAV's. However, the studies above only address the consequence of the rotor's impairment, since the analyzed type of anomaly is simulated Loss of Effectiveness [25] in thrust generation. Some other works follow the same approach with various methods of model-based fault estimation algorithms and following control strategies such as sliding mode control [42], Model Predictive Control [43], and Kalman filters [44].
The challenge of AD is often best solved using data-driven fault detection methods [29], [45], [46], [47]. They are based on statistical modeling and classification algorithms that use sensor-based data as input and output the probability of the occurrence of an anomaly. In [48], the proposed method included a hybrid Recurrent Neural Network [49] (RNN) with Long Short Term Memory (LSTM) [48] and CNN [20] architecture and reached approximately 92% accuracy in detecting actuator faults, using the state information from the UAV, like pitch, roll, pitch rate, roll rate, yaw rate and the input commands sent to the motors as input to the hybrid network. However, the dataset that was used in that work was recorded in ideal lab environments in a predefined setup platform. Most of these works are focusing on two types of sensor-based information, vibrations, and sound. In addition, the work in [48] copes with faults in UAVs, however, it excepts from the scope of this work for one main reason, which is the usage of LSTM; [48] presents an LSTM-based architecture, on top of a CNN, which means that the required memory amount is way too big for tiny UAVs, and real-time applications as well, which are the main interest of this scope. As a result, we train a small architecture that receives as input acoustic segments of size 1-second, and any greater segment length might result in a failure, in the case of tiny UAVs. That is, whenever the acoustic signals get longer (up to an anomaly), LSTMs are hard to converge into a detection. The reason is that the probability of preserving the vocal context from an acoustic segment that is extremely far from a segment that is currently being processed diminishes exponentially with the distance from it. That is, whenever 'normal' i.e. acoustic segments are long, the model might forget the content of positions that are distant in the acoustic segment. Another problem with LSTMs, is their parallelization inability of the acoustic segments processing, since they are required to be processed sequentially (frame by frame). In conclusion, LSTMs based methods are suffering from the following problems: • Sequential computation restricts parallel data processing.
33338 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
• LSTMs have no strict modeling ability of short and longrange dependencies.
• The distance between the frames' position in an acoustic signal is linear. The existence of AI in low memory devices such as UAVs and drones is not new; that is, in [50] for instance, was introduced a survey of methodologies that combines deep learning and data science algorithms (e.g., statistics, linear regression, Bayesian methods). Another review was introduced in [51] and discusses new and emerging forms of data and technologies which seems to be a new field for future developments on AI, as well as in [52] that presented a method for the conceptualization of healthcare system that is supported by autonomous AI devices (such as drones or UAVs) that can use edge health devices with real-time data. As the AI field progresses more and more from a complex architecture standpoint, in this paper we use a Transformerbased architecture (explained thoroughly next in this section), in order to avoid the main LSTM three drawbacks. A handful of works [53], [54] show how the use of vibrations as the main source of information helped achieve highly accurate detections, by extracting the vibration data using sensors and then use this data as input to a model that then classifies it as anomalous or normal. An instance of a work [55] that used this approach, achieved more than 94% accuracy in detecting faults in UAV components, by using vibration data as input to a fuzzy ART neural network model [56] that then outputs the probability of the anomaly has occurred.
A few works describe fault detection in UAVs by analyzing sound [57], [58]. The work in [59] used sound as a source of input data to a Feed Forward Neural Network model that outputs the probability that one of the blades is partially or fully broken. This method achieved 98% accuracy in detecting broken or partially broken blades. Those results show great potential in using sound emitted from the UAV as the source of data for AD with neural networks. However, the experiments were performed with a stationary, groundfixed UAV and an external high-class microphone, which is ideal. Moreover, their training methods were based on the assumption that an imbalance in the blade is equivalent to a partial loss of the blade. In another work, [60], a similar neural-based algorithm with physically impaired rotors and data collected in a real flight scenario resulted in 92% accuracy at detecting broken blades, although they can only detect whether a blade is broken or not. As such, the work in [57] takes into account a wider range of fault classes including broken rotors, bearing failure, and eccentric shaft faults. Their algorithm is based on classical machine learning methods such as k-Nearest Neighbors (KNN) [61], and Support Vector Machine [62], but the dataset was recorded in a noise-free lab, from a mobile-phone positioned about 1 meter from the UAV to make the recordings as clear as possible due to the indoor environment.
Our proposed method is meant to work in non-ideal environments, and the dataset that was created for our method was recorded in a variety of environments with noise. The work in [58], has exploited two kinds of DL models, RNN and CNN, to see which of them achieves better accuracy in AD. The task was to identify which rotor is malfunctioning and classify the malfunction under two fault classes, fractured tip, and edge distortion. The proposed method in the aforementioned work reached an accuracy of up to 98% by using an array of microphones which enables us to identify which rotor is malfunctioning but the use of an array of microphones makes the system complicated and might not be applicable to smaller UAVs. Moreover, the dataset that was used to train the model was in an indoor environment and without noise. However, in this paper, we propose an improvement by using a Transformer to weigh parts of the input differently to accentuate important features, and a system using a single microphone that weighs significantly less as a consequence.
Acoustic data as captured by a microphone has a single dimension by nature (a stream of samples in time). Yet, it can be presented as a two-dimensional matrix (Y-axis as its frequency dimension, and the X-axis as samples-in-time). Thus such a matrix can be presented as an image. The approach proposed in this paper is inspired by a CNN model called VGG [63], which was already used in the domain of acoustic analytics [64], [65]. The VGG is a CNN model used in the domain of computer vision, for tasks like image classification [66] and object detection [67]. The VGG architecture managed to achieve 8.0% top 5 error [63] on the test set at the task of image classification with the ILSVRC dataset, which consists of 1.3M images for training and 100K for testing and has 1000 different classes. VGG has scored the second highest among the tested models, right behind GoogLeNet [68] with a margin of 0.1%, but it has fewer layers and is much less complex.
Recall that in this paper we apply a Transformer architecture before the CNN, from the domain of speech recognition called Wav2Vec2 [26], that used in many tasks of speech recognition [69], and that has proved to be efficient whenever the dataset is small [32]. The Wav2Vec2 model takes raw sound data as input and outputs a more informative representation of that data [26]. Although the Wav2Vec2 is meant for tasks involving speech recognition, in this work we use the same Transformer to solve an acoustic problem. Our intuition is that an anomaly in acoustics can be considered an anomaly in patterns of speech, like speech impediments, as can be seen in [70] that shows how the Wav2Vec2 can be used for the task of identifying speech sound disorders. The Wav2Vec2 [26] is a group of transformers, that are mainly used in the context of speech recognition. The Wav2Vec2 family uses representation-learning to transform sound into a matrix representing that sound, while also accentuating important features in the sound. Representation-Learning [71], is a set of techniques that enables a system to automatically learn the representation of raw data for tasks like classification and detection [72]. This set of techniques replaces manual feature extraction and feature engineering. Transformers are a type of representation-learning that uses Self Supervised Learning (SSL) [73] to learn the best representation of raw data for a VOLUME 11, 2023 given task. SSL is a technique in machine learning that is used to train a model with unlabeled data, usually before training it again later with labeled data for fine-tuning [26], [74], [75]. There are several approaches for solving the problem at hand, though part of them are only capable of detecting very specific anomalies. While other methods use machine learning to classify a wide range of classes, yet, such methods usually require a massive dataset for the learning (training) process. The proposed method in our work uses a more compact and realistic dataset and uses a Transformer to handle the noise in challenging environments.

III. DATA COLLECTION
The following section describes the data collection phase, namely; (i) the manner in which the dataset was generated; (ii) what features were taken into account and their influence; and (iii), the differences between particular labels and their distribution.
The dataset 1 used in this paper were recorded manually by the authors. The micro UAV used for recording is the DJI's Tello drone. This quadrotor is a cost-effective micro-drone and is popular among beginners, intermediates, and even professional drone developers. In addition, this quadrotor (see Figure 1) is very easy to manipulate, since there is an official SDK for Android and iOS smartphones for controlling the quadrotor. Also, a user-friendly interface Tello SDK written in Python is included. This allows the owners to connect and send commands to it through WiFi and run self-made scripts to control it from a computer. Moreover, the Tello drone is particularly small and has multiple flight modes that make it very agile while flying. These advantages couldn't be ignored, and as a result, the Tello was chosen for the data generation task. The data generation process was not that simple, due to two major disadvantages; (i) the Tello has a short flight time, which is about 5-10 minutes; (ii) The Tello's WiFi communication uses the UDP protocol. Since this protocol works like a stream, the quadrotor might miss some commands, and whenever this situation occurs, it may land or even crash due to the loss of communication. Undesired crashes and landings forced us to scratch the current recording and start over.
For the recording procedure, it was a deliberate choice to test the significance of the microphone's weight on the balance of the quadrotor. The recording setup was as follows: a small piece of Tin (about 5 cm long) was taped pointing upwards on top of the quadrotor at its center of mass using Duck-tape. At the tip of the Tin piece, we placed a small JBL Tune 225 TWS Bluetooth earphone acting as a microphone, and the quadrotor was controlled by a computer that also received the audio stream from the earphone over Bluetooth.
The recording procedure took place in different environments: closed rooms and open spaces, all with and without noisy environments [76] (mostly human speech), since these constraints may affect the sound waves that the microphone is 1 Available upon request from the authors, as well as the code.
recording. These constraints were taken into account so that the dataset would be as diverse as possible, and to get better performance in non-ideal environments. The recordings were done both manually and automatically using two different scripts to control the quadrotor. The scripts also produced a log file in which information was written about the quadrotor every tenth of a second. The scripts logged information such as recording time, flying status (if an anomaly has occurred or not), barometric sensor data, yaw, pitch, and roll angles, height, and battery percentage. Next, the script saved the recorded audio in a WAV (Waveform Audio) file format and the log file corresponding to that recording in CSV (Commaseparated values) file format. For the automatic recordings, the script included numerous movement patterns that were pre-defined for variety, such as square orbits and turns in mutable altitudes.
Different types of anomalies were recorded, including partially broken and defective blades, undesired movements, or destabilization and hits from an external source. To create even more diversity, all the recordings were done with the Tello's original blades as well as third-party blades (slightly lighter than the original ones). Furthermore, actually broken blades were used in the recordings as well. Each recording lasts 2 minutes long. In order to understand how and when the small UAV experienced an anomaly while recording, we combined the movement commands sent to the quadrotor with the data received from its sensors. For instance, whenever the quadrotor moved and the command it received was to stay still, but an unwanted movement occurred. Another instance of an anomaly is whenever a hit is recognized. A hit can be characterized as a drastic unwanted change in the quadrotor's accelerometer sensor, hence, the same approach can be modified to similarly identify hits from external sources.
The audio recordings consist of 3 hours long of recordings, separated into two minutes for each audio recording (as mentioned before), each with a corresponding log file describing it. After the recording phase, and in order to use it for training the DL model, data engineering was needed. The audio recordings were split into 1-second long soundtracks and were saved into a directory with a unique name. Using the log files, for each recording a corresponding label was also written and saved as a text document) file format in a different directory with the same name as the recording. The result was two directories, one with WAV files of 1-second long sound bits, and the other contained the labels for each second-long sample.
The total number of samples is 11,040. In order to verify that the collected data is as diverse and non-trivial as we wished, we manually looked up for different types of Normal records and Anomaly records. To do that, we visualized the data. A natural challenge we faced while exploring the dataset was the class imbalance; a quick statistical analysis showed that 85% of the samples were labeled as Normal. In contrast, only 15% of the samples were labeled as Anomaly. This is not a surprise, since in most of the flight time the UAV doesn't  have any anomalies. In Section V, we discuss two methods in which the data is processed into visual images: spectrograms and emissions. One can see in Figures 2 and 3, a particular sample from an audio recording, labeled as ''normal'', and visualized as an emission from the Wav2Vec2 [26] Transformer ( Figure 2) and as a spectrogram ( Figure 3) We found out that flights with partially broken and defective blades are much harder to classify as anomalies, and are visually very similar to normal recordings. Hence, we speculated that our CNN (discussed in Section IV) would have a hard time catching these anomalies. Figures 4 and 6 shows two different samples, labeled as ''anomaly'', and visualized as an emission from the Wav2Vec2, and as a spectrogram in Figures 5 and 7.
The first sample visualizes part of the quadrotor's stabilization process. One can easily distinguish between the two samples. On the other hand, the second sample is much harder to classify as Anomaly. This sample visualizes one second from a flight where the quadrotor had a defected or partially broken blades. The sound emitted from the rotors was almost identical to the sound of proper rotors and will be explained in Sections IV and V. Finally, one can note in Figure 8 the Data-Collection process described in this Section; Namely, the Main-Computer component runs the process that sends   the commands to the UAV, then the UAV creates sound waves during its flight time. Next, two threads are working in parallel, which is the (i) transmission of information from the microphone to the computer, and (ii) transmission of yaw, pitch, and roll states from the UAV's Accelerometer to the computer. Finally, this process outputs whether the UAV is in Anomaly or Normal state.

IV. OUR APPROACH
The following section describes our proposed method to detect anomalies, of types: (i) unplanned stabilization, and VOLUME 11, 2023 (ii) malfunction/tackled propeller in UAVs using its emission sound.
The remainder of this section is structured as follows; Section IV-A describes in detail the idea of a Transformer and discusses the specific Transformer based architecture used in this work. Next, Section IV-B introduces the VGG architecture and discusses its use both as a separate model and as part of the proposed hybrid model in this work ( Figure 9). Section IV-C presents the Transfer Learning technique and its implementation using the Wav2Vec2 over the CNN model. Finally, Section IV-D describes the final algorithm.
A. Wav2BC+ AND Wav2Vec2-ASR-960h Wav2Vec2 [26] is a group of transformers, that are mainly used in the context of speech recognition. The Wav2Vec2 family uses representation-learning to transform sound into a matrix representing that sound, while also accentuating important features in the sound. Representation Learning [71], is a set of techniques that enables a system to automatically learn the representation of raw data for tasks like classification and detection [72]. This set of techniques replaces manual feature extraction and feature engineering. Transformers are a type of representation-learning that uses SSL [73] to learn the best representation of raw data for a given task. SSL is a technique in machine learning that is used to train a model with unlabeled data, usually before training it again later with labeled data for fine-tuning [26], [74], [75]. Note that in the scope of this paper, the self-supervised learning is a process already learned in the pre-trained Wav2Vec2 model, from which we perform the transfer learning. That is, the Wav2Vec2 model only was fine-tuned with our datasets and was not trained or re-trained using SSL.
As aforementioned, the Wav2Vec2 takes raw audio as input and outputs an image-like representation of the input data, which can then be passed to a classifier that can differentiate between an anomalous and regular sound. Namely, both the Wav2Vec2 models are composed of a multi-layer based convolutional feature encoder, and receive as input a raw audio matrix, and their output are latent speech representations for each time-step among T time-steps. Next, the speech representations are fed into a Transformer, that creates T representations, extracting information from the whole sequence. Finally, the feature encoder output is discretized, in order to represent the targets (outputs) as a self-supervisedbased objective function. The feature encoder contains a temporal convolution followed by a normalization layer and a GELU [77] activation function. Then, the encoder's total stride computes the amount of the T time steps, which serves as the Transformer's input. In this manner, it is possible to distinguish between the sound emitted from the rotors, which was almost identical to the sound of proper rotors. Next, the Transformer produces contextualized speech representations; that is, the feature encoder output is fed into a context network that follows the Transformer architecture as in [78]. The main change is that instead of fixed positional embeddings [78] which encode absolute positional information, the Wav2Vec2 exploits a convolutional layer that behaves as if it was a relative positional embedding. The convolution's output is being added to the inputs, followed by a GELU [77] activation function, and then apply the layer normalization process. One can note in Figure 10, the Wav2Vec2 architecture.
The Wav2Vec2 has been originally designed for humanspeech recognition. Yet, it is possible to exploit it for the general acoustic problem as in UAVs. Wav2Vec2 leverages self-supervised training in a continuous framework from raw audio data. It builds context representation over continuous speech representation and self-attention capture dependencies over the entire sequence of latent representation endto-end [79]. Speech representations can be used for several downstream tasks [80], such as AD in UAVs using sound. Similar to human speech, UAVs produce continual raw audio, when in a normal state that can be considered as a representation of silence in human speech, its anomalies are reflected as notable shifts in the acoustic signals (as in human speech), which can be recognized clearly. Therefore, exploitation of the Wav2Vec2 over an AD in UAVs sound can improve the model's ability to detect patterns in the data, which will eventually increase the model's accuracy.

B. CNN (VGG-16)
In this research, we have used an altered version of a popular CNN architecture, which is the VGG-16 [63]. The VGG-16 model is designed for image classification and object localization and won first place in ILSVR (Imagenet Large Scale Visual Recognition) competition in 2014. Although the VGG-16 model is mainly used for image processing tasks, it can be used for speech processing and phoneme recognition by converting a sound segment into a spectrogram or other visual forms that can be represented as an image. As a result, the model can classify segments or extract features from them. In order to train a VGG-based model, the input consists of a fixed-size 224 × 224 RGB image, where the only preprocessing being made is subtraction of the mean RGB value 33342 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   from each pixel, in any image of the training set. Next, the image is being passed through a chain of convolutional layers, with extremely small filters of size 3 × 3. The stride of the convolutions is set to 1 pixel, which is the spatial padding of the convolution operation. The layer's input is processed so that the spatial resolution is preserved after the convolution operation. Next, a spatial pooling operation is performed, followed by five max-pooling layers. The max-pooling computation on top of the convolutions is performed over a pixels window of size 2 × 2, with a stride of size 2. Next, a chain of convolutional layers is followed by 3 fully-connected layers, where the third performs a classification of over 1000 classes from the ILSVR [81] competition. Finally, a softmax layer outputs the probability for each class. Throughout the neuralnetwork architecture flow, all of the hidden layers are transforming the mathematical operations with a ReLU activation function [82]. We used the VGG-16 model in two methods: (i) an altered version of the VGG-16 CNN architecture as a standalone model; (ii) as the classifier in a two-layer model. In the second method, we use the Wav2Vec2 to extract features from sound segments and visualize them as images that are used as input to the CNN. In the next subsection, we go into detail about the second method that uses the Transfer Learning technique.

C. Wav2Vec2 OVER CNN
Our main approach is based on the combination of both the Wav2Vec2 and the VGG-based CNN, on which a Transfer Learning [83], [84] is applied. In DL, Transfer Learning is the application of knowledge gathered from a model that was trained for a specific task, that can later be reused as the backbone for a more advanced task. This approach is popular in DL, by applying pre-trained models that are used as the starting point on Computer Vision and Natural Language Processing tasks given the vast computing and time resources required to develop neural network models for these problems [85]. Using the Transfer Learning method can improve the chances of solving the AD in UAVs by using acoustic signals, having only 3 hours of recorded data, by using pretrained Transformers [86]. The Wav2Vec2 is a pre-trained model with over 960 hours of data to its training set so that it can perform well on tasks that are similar in nature to the tackled task in this paper. The exploitation of the Transfer Learning technique with a pre-trained Transformer over a CNN model (VGG), contributes to faster convergence of the model while improving its accuracy [87]. In this approach, the input of the model corresponds with the Wav2Vec2's input as presented in Section IV-A, and its output is an imagelike input which is the raw audio of the sound emitted from VOLUME 11, 2023  the UAV, that serves as input to the VGG-based architecture employed in this paper ( Figure 12). As for the output of the VGG-16 model (Section IV-B), our CNN model that is based on the VGG-16 architecture, and only outputs 2 classes, instead of 1000 as mentioned in Section IV-B, as the challenge in this work is the detection of anomalies and normal states in UAVs flight.

D. PROPOSED METHOD
In this paper we propose a method for AD in UAVs, using only the sound emitted from them. The method is based on a transformer-based model for binary classification, so that the model gets as input the raw sound that is emitted from the UAV, and outputs the probability that an anomaly has occurred. In this paper the model is built from two components, a transformer-based architecture for feature extraction called Wav2Vec2 [26] and a classifier model that is inspired by the VGG CNN architecture [63]. Figure 11 shows the architecture of VGG-16, while Figure 12 presents ours, which is the mini-VGG version of VGG-16.
Observing Figures 11-12, one can note that the main differences between the VGG-16 architectures, and the one presented by the CNN of the Wav2BC+, are as follows in Table 2. The result of our modified CNN architecture of the Wav2BC+ is a decrease in the number of parameters of the well-known VGG from 138M, to a shrunk version of the VGG with only 3.6M parameters.
Our approach demonstrates that it is possible to use tools from the domain of speech recognition and analysis in audioanalysis problems. It starts with the usage of the Wav2Vec2 in order to extract features from the raw audio data to get a representation of the audio. The new representation is then passed to the VGG-based CNN model, which serves as the classifier that computes whether the sound is anomalous or regular. In the training process, we used 1-second long sound samples of sound. To make training faster and create a more reliable model, we used transfer learning to fine-tune the transformer and train the classifier model. One can note in Figure 9 an illustration of the pipeline proposed in this paper.

V. EXPERIMENTAL EVALUATION
The following section is dedicated to the validation of our hypothesis that using Transformer-based techniques for UAVs AD using sound can outperform classical CNN techniques. Our evaluation was conducted on an HP Omen computer, Windows 11 64Bit OS with 3.20GHz AMD Ryzen 7-5800H CPU, 32GB of RAM, and NVIDIA GeForce RTX 3070 GPU, using PyTorch (v1.14.0) and scikit-learn (v1.1.3).

A. DATASETS FOR EXPERIMENTAL EVALUATION
This section presents 3 different datasets, each of them with its purpose and uniqueness for an appropriate experiment. Table 3 summarizes the datasets used for each of the experiments.

B. ANOMALY DETECTION USING CLASS-WEIGHTS
Recall that in our dataset, 85% of the samples were labeled as Normal. In contrast, only 15% of the samples were labeled as Anomaly (Section III), which leads to an imbalanced dataset situation. The problem that arises from an imbalanced dataset, i.e. the example ratio from the Anomaly and N ormal classes, should be addressed before any further progress. Suppose that our classifier would always produce ''normal-state'' as an answer for all the test examples, i.e. always predict N ormal. Even though it would obtain ∼ 85% of Accuracy, over our dataset (as ∼ 85% of the dataset contains N ormal examples), it would still perform poorly [88] when examining the Precision and Recall measures which indicate how successful the model is (the accuracy, precision, and recall are discussed thoroughly in the continuation of this section). As a result, we used a class-weighted cross-entropy loss function which was introduced with each class's weight, as the inverse ratio of the number of examples of each class in the dataset, i.e. |Anomaly| − 1 for class Anomaly, and |Normal| − 1 for 33344 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   class N ormal. Since our aim is to detect anomalies, the ratio mentioned above is normal. Therefore, we also tested the use of non-weighted binary-cross-entropy [89] loss function on the Transformer.

C. CROSS-ENTROPY & BINARY-CROSS-ENTROPY
Next, we present the loss functions considered for the models' construction in this work. We used both the Binary Cross Entropy (BCE), and the regular Cross Entropy (CE) one, as follows: The BCE compares a target y with a prediction p in a logarithmic and hence exponential fashion. In neural network implementations, the value for y is either 0 or 1, while p can take any value between 0 and 1. The formula of the BCE loss is presented in Eq.(1): When visualizing BCE loss for a target value of 1, the loss increases exponentially whenever the prediction approaches the opposite -0. This suggests that small deviations are punished albeit lightly, whereas big prediction errors are punished significantly. This fact makes the BCE loss as a good candidate for binary classification problems, whenever a classifier has two output classes. The Sigmoid activation function receives the last layer output (logits) as an input and outputs a single value between 0 and 1 which represents the probability of class 1 being the target class (while the probability of class 0 = 1 -P(class 1)). The BCE loss function except for a single input feature between 0 and 1. Therefore, the Sigmoid activation function is commonly used for binary classification problems as it can ensure the output of a neural network fits the BCE loss function's input expectations. The formula of the Sigmoid is presented in Eq. (2): The CE loss [90] on the other hand, compares a hot-dot target 1-dimensional vector y with a 1-dimensional probability vector p, both of them in a logarithmic and exponential fashion. In neural network implementation, the target vector consists of i = 1, 2, . . . , M entries such that exactly M − 1 entries are equal to 0, and the entry representing the correct class is equal to 1, while the prediction vector consists of M entries with values between 0 to 1. The CE loss is given as follows in Eq. (3): As for the output layer, we used the Softmax activation function that receives a 1-dimensional vector (logits) and outputs a 1-dimensional probability vector that contains the probability of each class in the vector. Therefore, using a Softmax activation function over the last layer output will ensure that the output of the model will fit the CE loss function. This fact, makes the CE a good candidate for Multiclass classification problems, whenever a classifier has more than 2 classes. Yet, it is possible to use the CE for binary classification problems, as it is a private case. The equation of the Softmax activation function is given by Eq. (4): where x is a specific element in the 1-dimensional output vector of the Softmax activation function. Next, we discuss in Sections V-D and V-E the models constructed in this experimental evaluation.

D. SPECTROGRAMS AND CNN
A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies over time. Spectrograms are widely used in speech processing [91] and phoneme recognition. It is because when using spectrograms, it is possible to VOLUME 11, 2023  distinguish a specific frequency and its decibels over time. This can be very useful for recognizing the vocal anomalies of a flying UAV. The first thing that comes to mind is how to adjust the dataset, so the CNN can be trained and evaluated on it. Since a spectrogram can be saved as an image, we could insert it into the CNN as input. Hence, for each recorded sample, we generated a spectrogram. The spectrograms are generated by NFFT (Non-equispaced Fast Fourier Transform) [92], with a sample rate of 16,000Hz. Each frame of audio is windowed using the Hann function [93] into a window of length 512, and the number of points of overlap between frames is 384. Such as, each spectrogram is represented as an image of size 320 × 240 pixels.
The ''new'' dataset contains the spectrograms and their labels. Each spectrogram was loaded into the dataset as greyscale images since RGB images are not 2-dimensional. The data set was then shuffled and split into 3 subsets: training set (80% of the entire dataset which contains 8832 samples), validation set (10% of the entire dataset which contains 1104 samples), and test set (10% of the entire dataset which consists 1104 samples). It is important to note that the proportion between the output classes is kept. Next, the CNN model uses the Adam optimizer [48] with a learning rate of 0.0001, with the BCE loss function. The CNN was trained with our training set for 10 full epochs, with a minibatch size of 16. At the end of each epoch, we evaluated the CNN's performance by validating its stats with our validation set. The mini-batch size of the validation epoch is also 16. Before starting a new epoch, we took the measurements of loss and accuracy for the training and validation sets. After the training session ended, we evaluated the model using the testing set. Next, Figure 13 demonstrates that the nonweighted BCE loss function outperforms the weighted-class cross-entropy loss function. We provided the training process both for the CE and BCE loss functions for 8 epochs. One can note from Figure 13 that after 1 epoch only, the BCE loss already converges into minimal value, yet for comparison purpose its loss is presented up to 8 epochs as the CE loss function.

E. Wav2Vec2 OVER CNN
The Wav2Vec2 group of Transformers provides a set of pre-trained Transformers, such as Wav2Vec2-Base [32]. that was pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset, and Wav2Vec2-ASR-Base-960H that was pre-trained on the same 960-hours dataset, and was finetuned for an Automatic Speech Recognition (ASR) task. The usage of pre-trained transformers allows us to fine-tune the Transformer with a small dataset (our dataset consists of ∼ 3 hours of labeled audio). When using pre-trained models to perform a task, in addition to instantiating the model with pre-trained weights, one also needs to build pipelines for feature extraction and post-processing, in the same manner, they were done during the training. To build this pipeline, we used the torchaudio.piplines module which contains prepared pipelines for each of the Wav2Vec2 models. The idea of Transfer Learning is widely used in this section via the Transformer which is designed to transfer its input sequence to another one with the help of two parts (Encoder and Decoder [94]) and the CNN model.
In order to implement the idea of transfer learning on our models (Transformer and CNN), the pipeline includes both models. The following sections describe the fine-tuning process while emphasizing the transfer learning idea in it as well. That is, Section V-E1 describes the fine-tuning process using an already fine-tuned (to a different problem) Transformer; Next, Section V-E2 presents the idea of fine-tuning a pre-trained Transformer for AD in UAVs acoustic problems without it ever been introduced to a similar problem before.

1) FINE-TUNING Wav2BC+ WITH Wav2Vec2-ASR-960H
The following model consists of a pre-trained Transformer (Wav2Vec2-ASR-Base-960H) which has been fine-tuned to ASR problems and a CNN. The input for the Transformer is a 1-second waveform with a shape of a 1-dimensional array and a sample rate of 16000Hz. The waveform is transferred to a tensor with a shape of [1], [29], and [49] (Transformer's output) and inserted as an input to the first layer of the CNN model.
Next, the fine-tuning process begins as the CNN model starts its training loop over the outputs from the Transformer. The CNN was trained for 10 full epochs, with a mini-batch size of 16. By the end of each epoch, we evaluated the CNN's performance by validating its stats using the validation set. One can note from Figure 14 that the non-weighted BCE loss function outperforms the weighted-class CE loss function.

2) FINE-TUNING OF Wav2BC+
The following model consists of a pre-trained Transformer (Wav2Vec2-Base) and a CNN. The input for the Transformer is a 1-second waveform with a shape of a 1-dimensional array and a sample rate of 16000Hz. The waveform is transferred to a tensor with a shape of [1,49,768] (Transformer's output) and inserted as an input to the first layer of the CNN model. Notice that the output of Wav2Vec2-Base and the output of ASR-960H-Wav2BC+ has different shapes, which is a direct result of the last Transformer being already fine-tuned to a specific case (as ASR).
The CNN was trained for 10 full epochs, with a mini-batch size of 16. By the end of each epoch, we evaluated the CNN's performance by validating its stats using the validation set. After the training session ended, we evaluated the model using the testing set. One can note from Figure 15 that the non-weighted BCE loss function outperforms the weightedclass CE loss function.

F. RESULTS & MODELS COMPARISON
In order to measure how good a model is, there are many different metrics that can indicate the quality of a model. For classification problems with a balanced ratio of the classes present in the training dataset, accuracy is good enough and can indicate quite well how good a model is at a certain task.
Accuracy aims to answer the question of how close a given set of measurements (observations or readings) are to their true value. The formula for computing the Accuracy is presented in Eq. (5): where True Positive (TP) is the number of inputs true and the model is classified as true, True Negative (TN) is the number However, accuracy is not a good enough indicator whenever the data is imbalanced, meaning that there are much more occurrences of a class relative to other classes. In this paper, the dataset is highly imbalanced (see Section III), i.e. the anomalies are by definition uncommon, hence naturally they occur infrequently in our dataset. Therefore, we use a different metric, called F1-score, which is a better metric for classification performance measurements in imbalanced datasets [95]. The F1-score uses two other metrics called Precision and Recall, to present a more precise reflection of the models' performance.
The Precision, is a measure of how many of the positive predictions made are correct (TP), and its formula is presented in Eq.(6), as follows: The Recall, (or Sensitivity) is a measure of how many of the positive cases the classifier correctly predicted, over all the positive cases in the data, and its formula is presented in Eq.(7), as follows: Finally, the F1-score is a metric that combines both Precision and Recall. It is generally described as the Harmonic-Mean [96] of these two. A harmonic mean is a way to calculate an average of values, generally described as more suitable for ratios (such as precision and recall) than the traditional arithmetic mean. The idea is to provide a single metric that weights the two ratios (precision and recall) in a balanced way, requiring both to have higher values for the F1-score to rise. The formula of the F1-Score is presented in Eq.(8), as follows: VOLUME 11, 2023 In order to create a fair and precise comparison between the 3 Models tested in this paper, each Model was measured using the following metrics: (i) Accuracy; (ii) Precision; (iii) Recall; and (iv) F1-Score. Table 4 presents each model's performance, based on each of these four metrics: At first glance, the results of these 3 different models trained as part of this study, it is not very clear which one produces better results in the test set. After a deeper understanding of the results, it is possible to evaluate which model yields better results for different types of tasks. The decision to train the models over 10 epochs is a result of the nature of the models, they start to divergence after 5-8 epochs, as presented in Figures 13, 14, and 15.
Since our dataset is imbalanced, the accuracy is inappropriate enough performance measure for the problem we study. The main reason is that the overwhelming number of examples from the majority class (N ormal) will overwhelm the number of examples in the minority class (Anomaly), meaning that even poor and untrained models can achieve accuracy scores of 90 percent and above. Therefore, comparing the accuracy of the models is effective, compared to other measurement metrics, as mentioned in Section V-F.

1) TRANSFORMER BASED MODEL VS CNN BASED MODEL
A comparison between the models from Entries (1) - (2) which implements the idea of Transfer learning in Table 4, and the model presented in Ent.(3) which implements the classic idea of CNN (VGG) based model, is sufficient to this study as it tests our thesis regarding the importance of using Transformers in order to detect anomalies in acoustic emitted from UAVs. Considering the Precision of each model in the Table, it can be clearly seen that the Transfer learning idea presented by Ent.(1) and Ent. (2) in Table 4 yields better performance, compared to the classic technique mentioned in Ent. (3). One can conclude from this that these models are almost never wrong when they detect an Anomaly.
In terms of Recall, Ent.(2) yields a value that is very close to the performance in Ent. (3). Yet, The model from Ent.(3) yields better Recall values than Entries (1) - (2). Despite of the tiny gap between the Recall of Ent.(3) and Ent. (2), it is possible to say that the Transfer Learning technique yields a model that manages to identify a fairly high number of Anomalies, which is more important to the problem presented in this work.
Next, and as can be seen from Ent.(2), a Transformer over CNN is a better model for tasks that focus on minimizing false positives, while the model from Ent.(3) is better for tasks that focus on minimizing false negatives. In imbalanced datasets, the goal is to improve Recall without hurting the Precision. Based on that, we might conclude that the Transformer over CNN model did better in the test compared to the classic model. However, neither Precision nor Recall tells the whole story, i.e. a model might have excellent Precision with terrible Recall and vice-versa. Thus, the F1-Score provides a manner to express both concerns with a single score. One can note from Table 4 in Entries (2) -(3) that the model that performed better is the Transformer over CNN, since its F1-Score is higher than the CNN Model.
These results support the thesis of this paper since it claims that Transformer based model would perform better than a CNN Model. The reason for this difference between the results is based on the structure of the models. The Transformer based Model is using self-attention layers, which helps the Model identify important features in the input and emphasize them. On the other hand, CNNs in their nature are not searching for important features in their input, but search for patterns over the entire input instead, which makes the detection process harder since there is no attention to the important details. Thus, the exploitation of Transfer Learning with the Transformer and the CNN allows the CNN to train and look for patterns over the important features and thus improve its performance.

2) Wav2Vec2-ASR-BASE-960H VS Wav2Vec2-BASE
The performance comparison of Entries (1) - (2) in Table 4 provides us with a deeper understanding of the selection of the Transformer. In addition, it shed light on whether using a fine-tuned (to a different problem such as ASR) model, that might result in underperformance compared to a regular pretrained Model.
According to Table 4, in terms of accuracy, Ent.
(2) yields a more accurate model (0.92) compared to Ent.(1) which yields 0.90. Since our dataset is imbalanced, the accuracy measurement method is not a good enough metric for this case. Therefore, the other metrics such as the Precision, Recall, and F1-Score are more accurate metrics.
As for the Precision, Ent. Considering the results of the models that correspond with Entries (1) -(2), it is possible to say that the regular pre-trained Transformer (Ent.(2)) performed better than the already fine-tuned Transformer (Ent. (1)). The main reason for this gap in results between these 2 models is the fact that fine-tuning a Transformer to a specific problem (ASR) reduces the number of output features that the Transformer feds the CNN with. As a result, the CNN receives a small number of features to search for patterns on, which leads the model to under-performance, compared to the model with the regular Transformer. Therefore, one can conclude that the Wav2Vec2-Base Transformer over a CNN (VGG) model is the best, out of these two.

G. RESULTS -OUT OF DISTRIBUTION EXPERIMENT
A vital criterion for deploying a powerful classifier in many real-world AI-based applications is the ability to detect test instances that are considered sufficiently far away from the training-set distribution. Many classification problems, such as speech recognition, visual object detection, and Anomaly  Results table for the comparison of the Transformer Wav2Vec2-ASR-Base-960H + CNN (VGG-based), Transformer Wav2BC+ (VGG-based), and the CNN (VGG-based) on spectrograms. Ent. (3) represents an ablation study [97] with respect to Entries (1)- (2), that examines the performance of the model, whenever the removal of the Wav2Vec2 component occurs. It is important to note that Entries (1)-(2) are based on the Wav2Vec2 model which is based on self-supervised learning, while in Ent. (3) is only exploited a CNN-based model, which corresponds to supervised learning.
Detection in general, have gained great accuracy metrics by using neural networks. However, determining the uncertainty of a specific prediction is still a difficult problem. Predictive uncertainty ability that is well-calibrated, is crucial since it can be used in a variety of AI-based applications.
Neural networks employing the Softmax (Eq.(4)) activation layer that is exploited for AD problems as in this work, are known to produce results that are relative to the training and test sets distribution. Yet, whenever it is possible, an effective AI framework has to be able to generalize in front of Out Of Distribution (OOD) [98], [99] cases, by flagging the ones that are beyond their capacity, as well as request human intervention. In the world of Anomaly Detection, the concept of OOD can be manifested in problems such as binary classification, or even one-class classifier [100]. One of the acceptable approaches to transforming IC into an OOD detection problem is adding an 'unknown' class to a classification model. However, this procedure requires apriori tagged OOD data for training, which is an unbound amount of data in theory -a difficult problem whenever the dataset to train is (i) limited, and (ii) bounded by the data collection process time-frame, i.e. the time and conditions of the data collection. Thus, when designing an architecture for a classification problem, one of the penetration-test that should be considered before detecting anomalies has to test OOD cases that might have been recorded in different time-frames and conditions.
As such, and in order to prove the robustness of our Transformer-based approach, we have recorded additional dataset by using the drone, and the same recording set, except for a different environment from the one described in Section III. That is, the new dataset has been recorded when musical songs are being played from a microphone, very close to the drone whenever it flies. Clearly, it is an OOD situation, since the initial dataset did not consider such a scenario at all. These audio recordings contain additional 300 test samples of size 1-seconds, such that 11% of the testsamples are representing the Anomaly class and 89% of the audio recordings are considered as Normal. The ratio between the Anomaly class samples and the Normal class samples is approximately 1:10, which simulates the real-world AD problem where Anomalies appear rarely. Finally, both the Transformer-based approach that was suggested in this paper, as well as the CNN-only-based one (VGG-16) were tested, by the same data pre-processing and inference processes as presented in Sections III and IV-D, with a Softmax threshold of 0.5, for the OOD computation.
Both the models yield slightly lower results when tested over the new samples of the experiment, as a result of the samples being recorded in a new and different environment than the ones on which the models were trained on. One can note from Table 5 that the Transformer-based model yields better results compared to the CNN-based model. These results support our thesis and prove that the Transformer based model is more robust and accurate than the CNN-based model.

H. REAL-TIME & EMBEDDED EXPERIMENT
To determine the feasibility of the proposed model in realtime scenarios, an inference of the model was deployed on a Raspberry Pi single-board computer. As aforementioned (in Section IV), an earbud was placed on top of the Tello quadrotor. The main idea of this experiment is to test the Wav2Vec2 model capabilities in real-time mode, on minicomputers that are equipped with basic hardware. An earbud was connected via Bluetooth to the Raspberry Pi, and thus it generated real-time audio samples. Next, each audio sample of length 1-second has been converted into a spectrogram, and fed to the input layer of the Wav2Vec2 model, to get a classification of the audio sample as an Anomaly or as a Normal sample.
The entire experiment consisted of running the feedforward function of the Wav2Vec2 and the CNN models, for exactly 5 minutes were recorded in real-time and generated 300 audio samples online. Later, and to test the real-time results, these audio samples were manually tagged sampleby-sample, i.e. second-by-second, such that each audio sample was tagged either as an Anomaly or as a Normal sample. These real-time audio samples contain exactly 300 test samples of length 1-seconds, such that 33 of them are of class Anomaly, and the remaining 267 are of class Normal. Again, we encounter the anomalous situation, in which the ratio between the Anomaly samples and the Normal ones is again ≈ 1:10.
The duration of the whole real-time experiment (without any optimizations) was 60 seconds; i.e., 0.2 seconds on average per one-second audio sample. That is, after recording each audio sample of length 1-second as a WAV file, it took 0.2 seconds on average to (i) turn the audio samples into spectrograms; (ii) turn the spectrograms into an inputmatrix to feed the input layer of the Wav2Vec2 model; and (iii) apply the feed-forward function of both the Wav2Vec2 model and the CNN models, and get a classification for the audio-sample (Anomaly, or Normal). In order to allow a fully real-time solution, the overall processing time should be lower than the sampling time. The major runtime component in the suggested system is the use of Transformers.   . Seeed Studio XIAO nRF52840 Sense: a ≈ 2cm*2cm, and less than 2 grams micro-controller, equipped with an IMU, a digital microphone, and a BlueTooth 5.0 communication module. Such a micro-controller can be used to run TinyML or TensorFlow Lite, and thus is a suitable candidate for implementing our sound analysis method.
In order to use the Wav2Vec2-Base pre-trained transformer in a mobile environment, performing quantization on it might be a necessary step. Thus, the model has been converted to a qint8 dynamic (i.e. weights-only) quantized model, a common solution for heavy models requiring significant RAM allocation -which is inapplicable for mobile devices. This operation shrunk the Wav2Vec2-Base model into a lighter version, from 360MB to 80MB, making it a more tailored model for edge-mobile usages. Next, we tested our approach on two embedded platforms, designed for real-time scenarios. As can be seen in 6, Ent.(1) presents the real-time experiment where the Wav2Vec2 model in the Wav2BC+ framework is the Wav2Vec2-Base transformer, on top of a CPU-based computation. Since such a model is too heavy for mobile devices, we repeated the same experiment as presented in this section (Ent.(2)), on top of the embedded device. As the lighter and quantized Wav2Vec2 model is smaller than the Base version regarding RAM allocation, we could expect a slight degradation in the accuracy metrics, while the average processing time per audio sample remains the same, thus making it applicable for mobile devices as well.
Finally, one can note from Table 6 that the Transformerbased model preserved the level of results from the last two experiments (Sections V-F and V-G). Moreover, the Wav2Vec2 method is suitable for real-time edge computing platforms such as Raspberry Pi and can be adapted to run on even smaller System on Chip (SoC) platforms.

VI. CONCLUSION AND FUTURE WORK
In this paper, we presented Wav2BC+, a framework to detect anomalies in sound waves emitted from a UAV using deep-learning methods, and focused on the benefits of transfer-learning to construct an improved model for the anomaly detection problem in UAVs. We have shown that by using a Transformer based model, followed by a CNN, one could achieve better results in detecting anomalies in UAVs using sound waves, compared to the well-known VGG (CNN-based) over spectrogram approach. That is, we have developed a real-time approach that outperformed two baselines, so that our suggested compressed version of the wellknown VGG-16 framework, is extremely smaller in terms of the number of parameters in the neural network, and is capable of yielding high accuracy in anomaly detection in UAVs as well. In terms of performance metrics, the Wav2BC+ maintains high accuracy metrics in all of the experiments, and reduces the number of parameters of the well-known VGG from 138M, into a shrunk version of the VGG with only 3.6M parameters. Moreover, we employed our technique over an extremely small dataset, which is a problem on its own due to a lack of information. In addition, the compressed version for CNN suggested in our approach, enables us to apply it on top of tiny devices that cannot cope with high-consuming applications. For industrial purposes, one can assimilate our transfer-learning framework on top of any kind of drone or UAV that is able to run such architectures.
Even though obtaining better results, the AD problem is still not entirely addressed. Hence, one possible direction for future research would be addressing the AD problem from an external sound source as well; i.e., creating a dataset of sound waves emitted by a UAV from different distances and not only from its top. Another possible direction for future research would be to classify anomalies per type by training the model with a larger dataset containing different examples of anomalies, labeled by their different types. From the architectural standpoint, another future work could be the construction of an actual acoustic sensor and analyzers for drones. Such devices may be implemented using a tiny micro-controller capable of running TinyML (as shown in Figure 16). The availability of such devices may help the community to construct a comprehensive dataset for a wide range of UAVs [18].
In addition, in terms of the neural networks' performance, more sophisticated deep-learning techniques can be of great utility, especially for the real-time scenario. Among such techniques, one can find depth-wise separable convolution [101], atrous spatial pyramid pooling [102], and attention mechanisms [103], [104], as well as improvement in the transformers themselves.