Anomaly Detection in Aerial Videos With Transformers

Unmanned aerial vehicles (UAVs) are widely applied for purposes of inspection, search, and rescue operations by the virtue of low-cost, large-coverage, real-time, and high-resolution data acquisition capacities. Massive volumes of aerial videos are produced in these processes, in which normal events often account for an overwhelming proportion. It is extremely difficult to localize and extract abnormal events containing potentially valuable information from long video streams manually. Therefore, we are dedicated to developing anomaly detection methods to solve this issue. In this article, we create a new dataset, named Drone-Anomaly, for anomaly detection in aerial videos. This dataset provides 37 training video sequences and 22 testing video sequences from seven different realistic scenes with various anomalous events. There are 87488 color video frames (51635 for training and 35853 for testing) with the size of 640 $\times640$ at 30 frames/s. Based on this dataset, we evaluate existing methods and offer a benchmark for this task. Furthermore, we present a new baseline model, anomaly detection with Transformers (ANDTs), which treats consecutive video frames as a sequence of tubelets, utilizes a Transformer encoder to learn feature representations from the sequence, and leverages a decoder to predict the next frame. Our network models normality in the training phase and identifies an event with unpredictable temporal dynamics as an anomaly in the test phase. Moreover, to comprehensively evaluate the performance of our proposed method, we use not only our Drone-Anomaly dataset but also another dataset. We will make our dataset and code publicly available. A demo video is available at https://youtu.be/ancczYryOBY. We make our dataset and code publicly available (https://gitlab.lrz.de/ai4eo/reasoning/drone-anomaly https://github.com/Jin-Pu/Drone-Anomaly).

Fig. 1.Abnormal and normal dynamics.We display some frames from the crossroads scene for demonstrating the importance of temporal information in detecting anomalous events in aerial videos.In the normal video clip (top), all vehicles have a consistent moving direction.We use a yellow box with an arrow to represent an example vehicle and its moving direction.In the abnormal video snippet (bottom), a vehicle (in the red box) moves backward on the road.We can see the importance of temporal context in this task.

I. INTRODUCTION
A NOMALY detection refers to the detection of visual instances that significantly deviate from the majority [1].Due to the expanding demand in broad domains, such as inspection [2]- [6], search operations [7], [8], and security [9]- [12], anomaly detection plays increasingly important roles in various communities including computer vision, data mining, machine learning, and remote sensing.With the proliferation of UAVs worldwide, massive produced aerial videos spur the demand for detecting abnormal events in aerial video sequences in a wide range of applications [13].For example, many long-endurance UAVs2 are developed and utilized in inspection operations [2]- [6].Large amounts of aerial videos are created by these UAVs, in which normal video segments often account for an overwhelming proportion of the whole video.It is time-consuming and costly to find potentially valuable information from long and untrimmed videos manually.Therefore, we are intended to adopt anomaly detection methods to temporally localize anomalous events in aerial videos automatically.
Usually we cannot know beforehand what anomalies are in a scene, because there are too many possibilities that are impossible to be exhaustively listed.By contrast, it is easy to have information on the nature of normality in advance.Hence, most existing methods for anomaly detection only use normal data to learn feature representations of normality and consider test instances that cannot be well described as anomalies.Massive studies [14]- [26] are dedicated to detecting and categorizing non-conforming patterns present in images.These studies mainly focus on spatial occurrences of anomalous patterns.In contrast, anomaly detection in videos aims at identifying temporal occurrences (i.e., start and end times) of abnormal events.In computer vision, many methods [27]- [34] have been proposed for this task in surveillance videos.In comparison with surveillance videos, UAV videos bring the following challenges: (1) moving-camera instead of static-camera; (2) variable spatial resolution due to changes in flight altitude.Existing works [35]- [37] predefine several categories of anomalous events, convert aerial video anomaly detection into an event recognition task, and utilize supervised methods to address this problem.By contrast, in this work, we are interested in unsupervised methodologies for this task.Because in many real-world applications, it is not possible to exhaustively list all anomalous events beforehand.More specifically, we train a model for anomaly detection in aerial videos using only normal data that can be collected easily in advance.
In this paper, we focus on detecting anomalous events in aerial videos.To this end, we create a new dataset, named Drone-Anomaly, providing 37 training video sequences and 22 testing video sequences from 7 different realistic scenes.The dataset contains real-world anomalous events that are not staged by actors.Based on this dataset, we evaluate existing methods and offer a benchmark.In addition, we note that modeling temporal context is critical (see Fig. 1).Most existing anomaly detection methods utilize convolution-based encoders for capturing spatiotemporal dependencies among the input video frames.However, this is limited in learning long-term relations due to limited temporal receptive fields of these models.In this paper, we present a new baseline model, ANomaly Detection with Transformers (ANDT), which takes as input several consecutive video frames, leverages a Transformer encoder to model global context, and utilizes a decoder to predict the next frame.More specifically, ANDT treats a video as a sequence of tubelets and maps them into tubelet embeddings by linear projection.For preserving spatiotemporal information, the tubelet embeddings are added with learnable spatiotemporal position embeddings and then fed into a Transformer encoder to learn a spatiotemporal feature.The decoder is subsequently combined with the encoder for predicting the next frame based on the learned spatiotemporal representation.Our network is able to well predict an event with normal temporal dynamics and identifies an event with unpredictable temporal dynamics as an anomaly in the test phase.
The main contributions of this paper can be summarized as follows: • We create an annotated dataset consisting of 37 training videos and 22 testing videos involving 7 realistic scenes, covering a large variety of anomalous events.This dataset expands the scope of anomaly detection research.In addition, we extensively validate existing methods in order to provide a benchmark for this task.Then, we detail our new dataset in Section V-A.And our network is described in Section IV.Section V shows and discusses experimental results.Finally, the paper is concluded in Section VI.

II. RELATED WORK
In remote sensing, there have been a number of works for anomaly detection in hyperspectral imagery [40]- [48].These studies mainly focus on locating pixels with significantly different spectral signatures from their neighboring background pixels in the spatial domain.For example, the Reed-Xiaoli (RX) algorithm [40] uses local Gaussian model to detect anomalies in hyperspectral images and has become a baseline model.In [41], a collaborative representation detector (CRD) is proposed to detect pixels with unknown spectral signatures.Recently, deep learning-based methods have drawn significant attention.[42] proposes to use an autoencoder to learn representative features in order to detect anomalies in an unsupervised manner.In [43], the authors employ convolutional neural networks (CNNs) to learn spectral-spatial features in this task and achieve outstanding performance.
From static imagery to multi-temporal images, much effort [49]- [56] has been made to detect anomalies in the temporal domain.For instance, [49] uses multispectral images over two years for locating and identifying crop anomalies in two soybean fields.In [50], the authors leverage multitemporal thermal infrared (TIR) images for detecting geothermal anomaly areas by spatiotemporal analysis.In [51], multitemporal Landsat images are utilized to detect NDVI anomalies for mapping incongruous patches in coffee plantations.
Moreover, we notice that in computer vision, many anomaly detection approaches [57]- [63] have been developed for fixed camera surveillance videos.By contrast, we think that anomaly detection in aerial videos is more challenging because the videos are usually acquired by moving cameras.There have been a few works for investigating anomaly detection in aerial videos.These works [35]- [37] regard this problem as an event recognition task.Specifically, they first predefine several anomalous activities and then leverage supervised methods to recognize the defined events from aerial videos.For example, [35] leverages object tracking and classification methods to obtain trajectories and semantic information, then utilizes an ontology-based reasoning model to learn spatiotemporal relations among them for detecting video events.In [36], the authors define three different safety-related anomalies and propose a functional approach that models temporal relations of time-to-collision safety indicators to detect these anomalies from UAV-based traffic videos.Furthermore, [37] proposes a hybrid approach that integrates trajectories and semantic information of objects to build high-level knowledge for extracting complicated critical activities and events from UAV videos.Most recently, based on AU-AIR dataset [64] that is proposed for object detection in UAV videos, [39] builds a dataset including several anomalous objects (hereafter we call it AU-AIR-Anomaly dataset) and proposes a supervised method CADNet to detect instances and contextual anomalies in aerial videos.Compared to our dataset, the AU-AIR-Anomaly dataset only contains a single scene, i.e., traffic, and its aerial video has a relatively stable perspective.
In real-world applications, there are many possible anomalies existing in a scenario, which cannot be exhaustively listed and defined in advance.Instead, the nature of normality is relatively stable and easy to know beforehand.Therefore, we propose an unsupervised method ANDT that learns feature representations of genetic normality from merely normal data and determines test data with large reconstruction errors as anomalies.Moreover, methods [35]- [37], [39], [65] all leverage convolution-based encoders for learning spatiotemporal dependencies among input video frames.Due to the limited temporal receptive fields, these models are unable to effectively capture long-term temporal relations.By contrast, our method ANDT adopts a Transformer-based encoder that confers our model with a global temporal receptive field and enables it to capture temporal dependencies among all input frames.With a global perspective, our model is adept at distinguishing the movement of instances from the dynamic background and provides rich contextual information for detecting anomalies.

III. DATASET
To address the lack of available datasets for anomaly detection in aerial videos, we present the Drone-Anomaly.This section introduces the construction of our dataset, including video collection and annotation.Finally, we present the overall statistics of the dataset.

A. Video Collection
We collect aerial videos on YouTube3 and Pexels4 using search queries (e.g., drone highway, UAV roundabout) for each scene.In order to increase the diversity of anomalous events, we retrieve aerial videos using different languages (e.g., English, Chinese, German, and French).Moreover, to ensure the quality of aerial videos, we remove videos with any of the following situations: too short duration, manually edited, not captured by UAV cameras, without clear anomalous events.We show four frames of an example video from each scene in Fig. 2.

B. Annotation
We assign video-level labels for training data.In the test phase, frame-level annotations are needed to evaluate the performance.Thus, we provide frame-level labels with binary values, where anomalous frames are labeled as 1, and 0 indicates normal frames.For each scene, training videos and testing videos with anomalies are provided.The details are shown in Table II.

C. Statistics
Our Drone-Anomaly dataset consists of long, untrimmed aerial videos that cover 7 real-world scenes, including highway, crossroads, bike roundabout, vehicle roundabout, railway

D. Comparison with Related Datasets
We compare our dataset with related datasets in Table II.Mini-drone dataset [38] consisting of 38 videos is proposed to parse video contents for privacy protection.The dataset contains three categories: normal, suspicious, and illicit behaviors.All events are staged by actors.This dataset can be used for different tasks, e.g., action recognition, video classification, event recognition, and event detection.In addition, based on the AU-AIR dataset [64], [39] annotates different anomalous events for detecting anomalies in aerial videos.The AU-AIR-Anomaly dataset contains four realistic anomalies, i.e., a car on a bike road, a person on a road, a parked van in front of a building, and a bicycle on a road.

IV. METHODOLOGY
In this section, we detail our model.First, we introduce future frame prediction-the framework we use for anomaly detection, in Section IV-A.Next, we give the detailed description of ANDT in Section IV-B.

A. Future Frame Prediction for Anomaly Detection
For anomaly detection in aerial videos, comparing to the commonly used reconstruction-based framework [31], [66]- [74] where target values are equal to the inputs, it is more natural to predict the next video frame conditioned on several consecutive frames and compare the predicted one with its ground truth.In this way, temporal context can be modeled.
The assumption of the future frame prediction framework is that temporal consistency in normal events is maintained stably, thus normal events are temporally more predictable than anomalies.In the training stage, a network is trained with only normal videos to learn normal temporal patterns.In the test phase, events and activities not perfectly predicted by the network are then deemed as anomalies.Formally, given a video V composed of consecutive T frames, V = {I 1 , I 2 , ..., I T }.All frames are stacked temporally and then utilized to predict the next frame I T +1 .The predicted frame is denoted as ÎT +1 .We aim to learn a mapping P as follows: To make ÎT +1 closer to I T +1 , we minimize their 2 distance in intensity space as follows: (2) In the test phase, the 2 distance between the predicted next frame ÎT +1 and the true next frame I T +1 is calculated for identifying anomaly.The frames with relatively large 2 distances are deemed as anomalies.

B. Anomaly Detection with Transformers (ANDT)
We propose a method ANDT as the mapping P. The Transformer [75] was originally proposed for sequence-tosequence tasks in natural language processing (NLP), such as language translation.Its main idea is to use self-attention that enables the model to capture long-range dependencies in a whole sequence.We observe that a video is naturally a temporal sequence, but with spatial content.Therefore, we interpret a video as a sequence of tubelets and process them by a Transformer encoder in order to capture long-term spatiotemporal dependencies.Furthermore, a 3D convolutional decoder is further attached for predicting the next frame based on the learned spatiotemporal relations.An overview of the model is depicted in Fig. 3.
Vision Transformer [76] performs tokenization by splitting an image into a sequence of small patches.In this work, since we deal with videos, we tokenize a video by extracting non-overlapping, spatiotemporal tubes.Specifically, the input video , where (H, W ) is the spatial size of video frames, C represents the number of channels, T denotes the number of frames, (t, h, w) is the dimension of each tubelet, n t = T t , n h = H h , and n w = W w .N = n t • n h • n w is the number of tokens.Then, we map the tubelets into a K-dimensional latent space by a trainable linear projection with weights E ∈ R (t•h•w•C)×K .By doing so, the spatiotemporal information can be preserved during the tokenization.
We also prepend a learnable embedding x cls to the sequence of tubelet embeddings.It also serves as the output feature p of the Transformer encoder.Furthermore, to inject original spatiotemporal position information into our model, we add learnable spatiotemporal position embeddings E pos ∈ R (N +1)×K to the tubelet embeddings.The equations are shown as follows: z 0 is subsequently fed into Transformer encoder layers, each consisting of two sublayers.The first is a multi-head self-attention (MSA) mechanism, and the second is a simple multilayer perceptron (MLP).Layer normalization (LN) is applied before every sublayer, and residual connections are used in every sublayer.The Transformer encoder takes as input these embeddings and learns a spatiotemporal feature p via: where We leverage a convolutional decoder to predict the next frame Îi+1 based on the learned spatiotemporal feature p. First, we leverage two fully-connected layers to increase the dimension of p and then reshape it into a 3D tensor of 8 × 8 × 512.This size is associated with the number of convolutional layers in the decoder.Considering both computational complexity and reconstruction accuracy, we use an decoder with 5 convolutional layers and upsampling layers.It progressively reconstructs the next frame with the size of 256×256×3 from the encoded feature tensor of 8×8×512.In particular, we leverage a progressive upsampling strategy that utilizes upsampling layers and convolution layers alternately.The upsampling rate is restricted to 2×.The batch normalization and ReLU are applied after each convolution layer.This strategy enables our decoder to learn spatial dependencies and upsample the learned features in a progressive manner, which leads to a better reconstruction of details and boundaries.

V. EXPERIMENTS
In this section, we present our experimental results.In Section V-A, We introduce the datasets used in experiments.Evaluation metrics are introduced in Section V-B.Next, several ablation studies are conducted to investigate the effectiveness of our method, and we report their results in Section V-D.Moreover, in Section V-E, we provide a benchmark on Drone-Anomaly dataset for anomaly detection in aerial videos by extensively validating existing methods, and we compare our method with these baseline models.In section V-F, we assess the performance of our method on AU-AIR-Anomaly dataset and compare our method with other competitors.Finally, we visualize the learned features of our method in Section V-G.

A. Dataset
To evaluate the performance of our method, we use not only our Drone-Anomaly dataset but also the AU-AIR-Anomaly dataset [39].A statistic of the two datasets can be found in Table II.  1 Reconstruction-1 is the strategy of inputting 1 frame and reconstructing itself. 2 Reconstruction-6 is the strategy of inputting 6 consecutive frames and reconstructing themselves. 3Prediction-1 is the strategy of inputting 6 consecutive frames and predicting the next frame.

B. Evaluation Metrics
The receiver operation characteristic (ROC) is a popular evaluation matrix in anomaly detection, and it is calculated by gradually changing the threshold.In addition, we also use area under curve (AUC) for the performance evaluation.We leverage a strategy to determine a threshold that is used to calculate recall, precision, F1 score, and overall accuracy (OA).Specifically, we feed the training set into the trained model to obtain reconstruction errors for all training samples.The threshold is determined as the sum of the mean value and the standard deviation value of the reconstruction errors.We note that AUC is the primary metric, as it can comprehensively evaluate the performance of a method.

C. Competitors
We compare our network with several state-of-the-art detection models.
• CAE [67]: CAE (convolutional autoencoder) aims to leverage the convolutional encoder to map the input frames into a latent space to learn features.A convolutional decoder is then employed to reconstruct a frame based on the learned features.Its reconstruction error is used for detecting anomalies.
• CVAE [69]: CVAE (convolutional variational autoencoder) introduces a regularization into the representation space.It utilizes a prior distribution over the latent space to encode normal instances.This prevents the overfitting problem and enables the generation of meaningful frames for anomaly detection.dilated filters, where the center area of the receptive field is masked.The block learns to reconstruct the masked area using contextual information.It can be incorporated into various existing models.In this paper, we equip it on the MNAD [74] model, which is still denoted SSPCAB.

D. Ablation Studies
We present a series of ablations for evaluating the effectiveness of our model.All of them are conducted on the highway scene with the most number of training and test frames.
Model design.In the course of experiments, we find that the design of the Transformer encoder matters.Hence, we want to investigate different configurations and figure out optimal settings.Concretely, the following hyperparameters are taken into account: patch size, number of Transformer layers, number of attention heads, and MLP size.From Table IIIa, it can be observed that the model with a patch size of 16 × 16 achieves better comprehensive performance.The patch size is actually associated with the extent to which the model excavates inner information in patches and spatiotemporal relations among patches.In Table IIIb and IIIc, we focus on self-attention and find that using 2 Transformer layers and 6 attention heads exhibits superior performance.Multi-head self-attention enables the model to integrate multiple temporal information from different representation patches.And the small number of Transformer layers ensures a relatively small computational complexity.Finally, MLP size determines the size of the output spatiotemporal feature of the Transformer encoder.In Table IIId, we can see that an MLP with a size of 4096 brings good results to our model, which could be caused by the improved information capacity of the spatiotemporal feature.
Prediction vs. Reconstruction.In our network, future frame prediction is an important strategy to learn temporal dependencies for effectively detecting anomalies.To evaluate how it affects the performance, we compare our predictionbased framework with a commonly used reconstruction-based methodology [31], [66]- [74].More specifically, with the  Number of input frames.We further investigate how the number of input frames affects the performance of our method.We evaluate the performance of ANDT with a variant number of input frames.The results are reported in Table V.We can see that the method with 6 input frames exhibits superior comprehensive performance.The performance of our model gradually gets better as the number of input frames goes from 2 to 6 and then degrades with more input frames.This observation demonstrates that a few frames are not enough for modeling temporal context, but too many input frames bring a deteriorated performance.

E. Results on the Drone-Anomaly Dataset
We evaluate various baseline models on all scenes in our Drone-Anomaly dataset with standard evaluation protocols and offer a benchmark.The results are reported in Table VI and VII.Also, we compare the proposed model with other competitors.
Highway.This scene presents various kinds of anomalous events, e.g., a cow herd walking on the street, an accidental car collision, and a road section covered by sand and dust.These different anomalous events make this scenario very challenging.Comparing to other competitors, our method achieves the best results in AUC (68.7%), and Recall (68.4%).The main competitor in this scene is MemAE that also exhibits very good results in some metrics.However, its accuracy in AUC is relatively a bit low.Our method demonstrates the Crossroads.This scene focuses on distinguishing various anomalous behaviors of vehicles and persons, such as persons crossing the road irregularly and vehicles moving backward.In this scene, capturing temporal dynamics of persons and vehicles on the road is critical for identifying their anomalous behaviors.From the reported results in Table VII, our method achieves the best results in AUC (65.2%),Precision (66.3%),F1 score (64.6%), and OA (65.8%).This is mainly because the Transformer encoder of our approach is able to effectively model long-term temporal relations for distinguishing anomalous moving directions of persons or vehicles.We visualize the prediction of our method on a video clip of this scenario in Fig. 4 (see the third row), in which an anomalous event is that a person crosses the road not following the rule.We can observe that the traffic is hindered by the person crossing the road irregularly.In this case, dynamically sensing traffic speed is crucial for the successful detection of anomalous events.The numerical results demonstrate the effectiveness of our model.For evaluating the performance of detecting different kinds of anomalous events, we group anomalies into two categories: person-related anomaly and vehicle-related anomaly.The AUC results of each anomalous event are reported in Table VIII.Compared to other methods, our approach achieves the best AUC results in both two kinds of anomalies.
Bike roundabout.Only one type of anomaly, i.e., moving vehicle on the bike roundabout, is presented in this scene.However, more than one abnormal event may be present in the test video sequence.This scenario can verify whether a method is able to continuously detect all anomalous events in a test sequence.Our method exhibits superior performance.We also observe that memory-based methods have poor performance.The reason for this may be that some feature representations of abnormal video frames misidentified as normality are memorized in the memory space, which deteriorates the performance of these models in recognizing subsequent anomalous frames.
Vehicle Farmland inspection.One type of anomaly, i.e., unidentified vehicles, exists in this scene.Searching anomalous objects is the goal in this scene.From experimental results, our network achieves the best accuracies in AUC (79.5%), and exhibits superior performance in searching anomalous objects.
In summary, our model exhibits superior performance in multiple scenes, including highway, crossroads, bike roundabout, and farmland inspection, in which many anomalous events with temporal dynamics exist.Specifically, in the highway scene, our method presents a better performance of detecting different anomalies than memory-based methods, i.e., MemAE and MNAD, which are specially designed to deal with various anomalies.This is because the global temporal receptive field enables our model to learn discriminative temporal representations of normality, which is used to effectively detecting different anomalies.

F. Results on the AU-AIR-Anomaly Dataset
Further, we use AU-AIR-Anomaly dataset [39] to validate the performance of our approach and other methods.Due to the non-availability of public ground-truth labels for anomalies in the AU-AIR-Anomaly dataset, following [39], we label four anomalous events: a car on a bike road, a person on a road, a parked van in front of a building and a bicycle on a road.We report numerical results in Table VI.As we can see, our model has a superb performance and achieves the best accuracies in AUC (86.7%),Precision (84.9%), and F1 score (82.7%).The scene of this dataset is highly similar to crossroads in our Drone-Anomaly dataset.Our network still exhibits stable and superior performance, which demonstrates its good generalization ability across different datasets.

G. Visualization of the Learned Features
We visualize in Fig. 5 the distribution Transformer features of some randomly chosen test samples on the crossroads scene in the Drone-Anomaly dataset.We leverage principal component analysis (PCA) to reduce the dimension of the features to 3. From the figure, it can be seen that normal instances (blue points) are all concentrated in a relatively small area, while abnormal samples are far away from the

H. Discussion
To verify whether our method raises too many false alarms in practical applications that do not contain any anomalies, we collect a new scene, i.e., river inspection, which does not contain anomalous events.We use a DJI drone to inspect a normal river and collect an aerial video for this validation.We show four sample frames of the test data in Fig. 6.We report mean squared reconstruction error (MSRE) values on training data and test data, and they are M SRE tra = 0.076 and M SRE test = 0.078.We can see that these two values are very close.Besides, we calculate false positive rate, F P R = 0.0041, which is very low.These mean that in scenes without any anomalies, our model also works well.

VI. CONCLUSION
In this paper, we focus on detecting anomalous events in aerial videos.To this end, we create a new dataset, termed Drone-Anomaly, providing 37 training video sequences and 22 testing video sequences, covering 7 real-world scenes, providing various anomalous events.Based on this dataset, we offer a benchmark for this task.Moreover, we present a new baseline model, ANDT, which treats a video as a sequence of tubelets and leverages a Transformer encoder to learn a spatiotemporal feature.Afterwards, a decoder is combined with the encoder for predicting the next frame based on the learned spatiotemporal representation.And we conduct extensive ablation studies for validating the effectiveness of our network.Moreover, we compare our model with other baselines.The experimental results demonstrate its outstanding performance.In the future, we will focus on spatiotemporally detecting anomalous events in aerial videos.

Fig. 2 .
Fig. 2. Overview of the Drone-Anomaly dataset.We show four frames of each video.The anomalous frames are marked with red borders, and frames with green borders are normal ones.

Fig. 3 .
Fig.3.The overview of ANDT.Our method treats a video as a sequence of tubelets and maps them into tubelet embeddings by linear projection.For preserving spatiotemporal information, the tubelet embeddings are added with learnable spatiotemporal position embeddings and then fed into a Transformer encoder to learn a spatiotemporal feature.The decoder is subsequently combined with the encoder for predicting the next frame based on the learned spatiotemporal representation.

Fig. 4 .
Fig. 4. of anomaly detection results of our method and a main competitor.We show frame-level anomaly scores (orange curves indicate ANDT and blue curves denote MemAE).Ten frames of each video are shown, and anomalous are marked with red borders.Red rectangles ground truth data.A demo video is available at https://youtu.be/ancczYryOBY.

Fig. 5 .Fig. 6 .
Fig. 5. Visualization of feature distribution.We visualize the distribution of the learned spatiotemporal features from the Transformer encoder on the highway scene.The features of normal frames are represented by blue points, and features of anomalous frames are red points.

TABLE I
DATASET DETAILS.WE PROVIDE VARIABLE DETAILS OF THE DRONE-ANOMALY DATASET.
* The AU-AIR dataset is originally created for object detection tasks.inspection,solarpanelinspection, and farmland inspection.Various anomalies in these scenes have important practical significance and applications.We provide the overview of our dataset in TableII.Basically, the dataset consists of 37 training video sequences and 22 testing sequences.Each of them is at 30 frames per second and with a spatial size of 640 × 640 pixels.There are a total of 87, 488 color video frames (51, 635 for training and 35, 853 for testing).

TABLE III ABLATIONS
ON THE ANDT DESIGN.WE SHOW AUC, F1 SCORE, AND OVERALL ACCURACY (OA) OF SEVERAL TRANSFORMER DESIGNS WITH DIFFERENT CONFIGURATIONS.THE BEST ACCURACIES ARE SHOWN IN BOLD.

TABLE V
NUMBER OF INPUT FRAMES.WE REPORT THE PERFORMANCE OF OUR MODEL WITH A VARIANT NUMBER OF INPUT FRAMES.THE BEST ACCURACIES ARE SHOWN IN BOLD.
[34]VAE[70]: adVAE (self-adversarial variational autoencoder) assumes that both anomalous and normal prior distributions are Gaussian.It utilizes a self-adversarial mechanism that adds discrimination training objectives to the encoder and decoder.••MKD[33]:MKD(multiresolutionknowledge distillation for anomaly detection) proposes to distill the knowledge of a pre-trained expert network into another more compact network to concentrate solely on discriminative features that are helpful in distinguishing normality and anomaly.•SSPCAB[34]:SSPCAB (self-supervised predictive convolutional attentive block) uses a convolutional layer with

TABLE VI COMPARING
OUR APPROACH AGAINST OTHER METHODS.WE COMPARE OUR ANDT WITH OTHER COMPETITORS ON HIGHWAY, CROSSROADS, BIKE ROUNDABOUT, AND VEHICLE ROUNDABOUT SCENES.THE BEST ACCURACIES ARE SHOWN IN BOLD.

TABLE VII COMPARING
OUR APPROACH AGAINST OTHER METHODS.WE COMPARE OUR ANDT WITH OTHER COMPETITORS ON RAILWAY INSPECTION, SOLAR PANEL INSPECTION, AND FARMLAND INSPECTION SCENES.THE BEST ACCURACIES ARE SHOWN IN BOLD.

TABLE VIII AUC
RESULTS OF DIFFERENT KINDS OF ANOMALIES IN CROSSROADS.WE OFFER AUC RESULTS OF TWO KINDS OF ANOMALIES IN CROSSROADS.THE BEST ACCURACIES ARE SHOWN IN BOLD.We first report results of these models in the five evaluation metrics.Then, we calculate the difference between the average anomaly score of normal frames and that of abnormal frames, represented by ∆ s .The network with a relatively large ∆ s is more capable of distinguishing frames from normal frames.All results are shown in TableIV.It can be seen that the prediction-based framework can achieve better results in AUC, Recall, F1 score, OA, and ∆ s .

TABLE IX COMPARING
OUR APPROACH AGAINST OTHER METHODS ON THE AU-AIR-ANOMALY DATASET.WE COMPARE OUR ANDT WITH OTHER COMPETITORS ON AU-AIR DATASET.THE BEST ACCURACIES ARE SHOWN IN BOLD.
roundabout.Various anomalous events, such as traffic congestion and people crossing the road irregularly, are present in this scene.Memory-based and GAN-based methods, namely Skip-GAN, MemAE, and MNAD, show superior performance in this scene.Our model suffers from insufficient training data and performs relatively poor.Railway inspection.This scene presents only one kind of anomaly, i.e., obstacles on the railway.Determining the existence of obstacles on the railway is vital in practical applications.From the results in TableVII, there is no dominant method.The reason might be the insufficient training data (only 400 frames are available for training) cannot ensure that these models learn strong feature representations of normality.Solar panel inspection.Two anomalies, unknown objects/animals and panel defects, appear in this scene.Our model achieves the best accuracies in Precision (66.0%) and OA (60.8%), and provides relatively satisfactory results in this scenario.