Skeletal Keypoint-Based Transformer Model for Human Action Recognition in Aerial Videos

Several efforts have been made to develop effective and robust vision-based solutions for human action recognition in aerial videos. Generally, the existing methods rely on the extraction of either spatial features (patch-based methods) or skeletal key points (pose-based methods) that are fed to a classifier. Unlike the patch-based methods, the pose-based methods are generally regarded to be more robust to background changes and computationally efficient. Moreover, at the classification stage, the use of deep networks has generated significant interest within the community; however, the need remains to develop accurate and computationally effective deep learning-based solutions. To this end, this paper proposes a lightweight Transformer network-based method for human action recognition in aerial videos using the skeletal keypoints extracted using YOLOv8. The effectiveness of the proposed method is shown on a well-known public dataset containing 13 action classes, achieving very encouraging performance in terms of accuracy and computational cost as compared to several existing related methods.

Traditional methods for action recognition used various sensing modalities, including accelerometers, magnetometers, and gyroscopes, to capture body movements, frequency of motion, angles and orientation of body parts, velocity, and acceleration along with some other advance features [8], [9], [10], [11].Although these methods are computationally efficient, robust to noise and illumination changes, and easily implementable, they are limited in terms of their scalability, accuracy and adaptability as compared to the computer The associate editor coordinating the review of this manuscript and approving it for publication was Long Xu.
vision-based methods.With the availability of large image datasets, the use of computer vision has been the trending choice for action recognition [12], [13], [14].
Specifically, vision-based action recognition methods are classified into two main types: patch-based and pose-based.Patch-based methods are generally based on the extraction of spatial features at frame level, which are further processed to extract temporal dependencies across the video sequence [12], [13], [14].A limitation of the patch-based approaches is that they generally have a higher computational cost associated with feature extraction.Pose-based methods instead rely on the use of 2D skeleton data, which provides an outline of the human body joints without involving scene context, for action recognition methods [15], [16], [17].These methods are generally considered to be more robust to background changes and can inherently better represent bodily movements than patch-based methods.Additionally, recent advancements in pose estimation techniques [18], [19], [20] have made it easier to obtain accurate points of human joints, even when they are difficult to distinguish or are obscured.Moreover, processing skeleton data also requires lesser computational resources and has a lower training time as compared to the patch-based methods.
Indeed, there has been a lot of interest among research community in employing deep learning-based models for human action recognition using pose information.Some methods have been proposed that are built on Transformerbased models [21], [22] to solve the problem.Other approaches [23], [24] relied on using Graph Convolutional Network (GCN) for extracting temporal dependencies and demonstrated encouraging performance.However, these methods [21], [22], [23], [24] assumed fixed camera settings and may not be directly applicable for the case of aerial videos (with top-downish view) due to significant viewpoint changes plus the movement of UAVs could cause motion blur.
To this end, this paper proposes an efficient and light-weight deep learning-based model for human action recognition in aerial videos.The proposed method adopts a two-stage approach.The first stage is based on extracting skeletal keypoints using YOLOv8 pose extractor.In the second stage, the extracted keypoints are then fed to the Transformer network to train it on a wide variety of action types.Indeed, the use of the Transformer-based model with skeletal keypoints for aerial videos has been largely unexplored.We evaluated the usefulness of the proposed method on a well-known public dataset that contains a wide variety of action types and assessed the performance and computational complexity with encouraging results as compared to several existing related methods.
The specific contributions of this work are listed below: 1.An efficient and light-weight Transformer based model has been presented for vision-based human aerial action.2. A two-stage method is adopted in which the first step involves extracting 2D skeletal keypoints using YOLOv8 and the second step performs training and testing on a Transformer network for varying action types.Indeed, the use of Transformer network for aerial action recognition in videos with skeletal keypoints is not well explored.3. The effectiveness and efficiency of the proposed method demonstrated on a public dataset containing a variety of action types with a superior performance than existing related methods.

II. RELATED WORK
There exist several methods that are based on using traditional machine learning approaches with manual feature crafting for action recognition; however, they have limitations in terms of a trade-off between performance accuracy and computational cost.For example, in [25] the authors extracted skeletal keypoints using Kinect sensor and then used Hidden Markov Models to find the temporal relations for action recognition.The authors in [26] utilized optical flow for the extraction of motion features, which are then fed to SVM to perform classification.Ohn-Bar and Trivedi [27] used skeletal data with Histogram of Oriented Gradients (HOG) for feature description to perform classification of various action types with SVM.With the advancements in deep learning and the availability of large datasets, most traditional approaches towards action recognition have become less desirable.In [28] and [29], the authors employed two-streamed network that used 2D CNNs on individual frames followed by a 1D module to aggregate the per-frame features.These methods, although effective, are limited in their ability to encode temporal information due to the use of 2D CNNs.Alternatively, the authors of [30] jointly modeled spatial and temporal information by using 3D CNNs.Other modifications of 3D CNNs such as inflating 2D convolution kernels [31] or decomposing 3D convolution kernels [32] were proposed to improve the performance.Sultani and Shah [33] utilized a disjoint multi-task learning approach based on 3D CNNs to address the action recognition task, when there is an availability of a small dataset.They used the game data of GTA and FIFA along with GAN generated aerial data from actual ground data for training, and then the model is tested on real aerial data.Kotecha et al. [34] designed a Faster Motion Feature Modeling (FMFM) based system with Accurate Action Recognition (AAR) modeling.Their proposed system used a cascade of CNN-based models for both FMFM and AAR.Mliki et al. [35] developed a CNN based algorithm that used AlexNet [36] for detection and GoogleNet [37] for activity classification with ten classes.The authors in [38] proposed a model that used VGG16 [39] for CNN-based feature extraction in color and optical flow images and the Lattice LSTM for classification of temporal dependencies.Wang et al. [40] also introduced an action recognition framework named Temporal Segment Network (TSN), which divided videos into equallength segments.Then, a sequence of snippets is created from these segments, which can be of variable length.A consensus function aggregates the outputs of all the snippets to create the final class hypothesis.In [41], the authors designed an onboard UAV model for ten different gestures, which used YOLOv3-tiny for human detection, OpenPose [18] for pose estimation, and DNN for gesture classification.They used their own data for training and evaluation.In [42], Ahmad et al. used YOLOv5 for object detection in frames and Stochastic Gradient Boosting for action classification with 12 different action types [43].Ding et al. [44] proposed a lightweight model for action recognition in aerial videos.They employed a TCN based method, which used MobileNetV3 as feature descriptor and attention module for finding temporal relations among the frames.The authors in [45] presented an approach towards action recognition by using semi-supervised and unsupervised domain adaptation.Srivastava et al. [46] proposed a system for violent action detection using Part Affinity Fields [18] for pose estimation and SVM with RBF Kernel for classification.They also created their own data for training and evaluation.Most of the above-mentioned methods used CNN models for the extraction of spatial features and, in some cases, temporal features as well, and generally have a higher computation complexity and cost; hence, requiring powerful GPUs.This makes them less deployable in real-world applications involving aerial camera settings.Moreover, the use of Transformer networks [47] is growing with encouraging performance in several vision tasks [48], but relatively less explored for solving the human action recognition problem.

III. PROPOSED TRANSFORMER-BASED ACTION RECOGNITION METHOD
The proposed method uses skeletal body keypoints for pose estimation that are extracted using YOLOv8.These keypoints are preprocessed to make them compatible to be fed to the Transformer network for training and testing.The use of the Transformer-based network is inspired from an earlier work [22] that was aimed at ground-based fixed camera setting.The proposed method involved architectural changes including data augmentation and removal of dropout layers to adapt it for the application at hand.

A. POSE ESTIMATION
We employed YOLOv8 pose extractor, which provides 17 keypoints of the whole body.Compared to other pose extractors (OpenPose [18], YOLOv7 [49], EfficientPose [50]), we practically observed that YOLOv8 pose extractor is faster and more accurate.Figure 1 illustrates the extracted keypoints on a sample image in which a person is performing a kicking action.Each input video to the pose extractor has the form of (T, H, W, C), where T is the number of frames; H , W and C are the height, width, and number of channels in the video.The pose extractor returns the output in the form of (T, P) for each video, where P represents the extracted keypoints, After the keypoints have been extracted, they are preprocessed to be fed to the Transformer model for training and classification.

B. TRANSFORMER NETWORK
The architecture of the Transformer encoder layer is shown in Figure 2. The encoder layer is repeated multiple times, depending upon the requirement and architecture.This model was originally developed for language processing to perform task like Neural Machine Translation.It is very efficient in terms of keeping track of temporal dependencies in long sequences of data.The primary block responsible for memory or temporal relation is the Self Attention block.This block finds the temporal relation of every instance with every other instance.Figure 3 shows different steps of calculating Self Attention.Q, K and V are linearly transformed embedding vectors (or matrices if stacked) of the input instances.Matrix multiplication of Q and K matrices is calculated, which is then scaled as shown in the Figure 3.The scaled values are then passed through a SoftMax layer, whose output is used to calculate the final matrix multiplication with V matrix.The pre-processed keypoints of each action are divided into S K sequences, where each sequence has the form of (T, P); where T is set to 30 in our case and P represents the keypoints as follows: Here, i is the number of frames in a sequence and j is the number of keypoints in each frame.The Transformer model will extract the temporal features from 30 consecutive frames of each sequence.The keypoints of the frames are first linearly transformed into an Embedding Matrix and are added with an additional Positional Embedding Matrix, which provides positional information of each frame, creating X emb .The dimension of X emb becomes (T, d model ), where d model is the embedding dimension of each vector (row).The positional Embedding Matrix has learnable parameters.X emb is then used to create Q, K and V matrices as shown below: W Q , W K and W V have learnable parameters and their dimensions are usually the same i.e., d q = d k = d v .So, the   3.This process of creating Q, K, V and attention is repeated h times, where h is the number of heads used in the model.Then the results of all the heads are concatenated and are transformed again by another layer through W 0 which has the dimension of (hd v , d model ).So, the output dimension of multi-head attention becomes (T, hd v ).This output is then passed to a feed forward network, which linearly transforms it by the following operations: where W 1 and W 2 has dimensions of (d model , d ff ) and (d ff , d model ) respectively, and x is the output of the multi heads.We chose d ff = model .This whole operation is illustrated in Figure 4.This encoder layer is repeated multiple times.

IV. EXPERIMENTAL VALIDATION
This section presents the experimental validation and analysis of the proposed method, including the description of the dataset followed by the analysis of results.

A. DATASET
We used a well-known publicly available dataset for evaluation, the Drone-Action dataset [51].This dataset contains 13 classes and a total of 240 high resolution (1920 × 1080) videos with 25 frames per second.It is recorded in an outdoor environment with a camera mounted on a low altitude and low speed drone.Also, it has used 10 different actors so as to introduce a level of diversity.The dataset was collected on an unsettled road in the midst of a wheat field from varying top-downish viewpoints.The background wheat field can also pose a challenge (background clutter) to the CNNbased feature extraction approaches.The dataset provides three different splits of training and test datasets, referred to as Split 1, Split 2, and Split 3. Figure 5 shows some  representative images, showing all action classes used in the evaluation.

B. RESULTS & ANALYSIS
We performed a detailed evaluation of the proposed method on Drone Action dataset.We experimented with varying number of Transformer encoder layers and reported the results accordingly.Also, we have performed data augmentation by flipping frame keypoints horizontally (along y axis) to raise the number of training samples.Table 1 shows the hyperparameters of the Transformer model.Figures 6,7,8 show the confusion matrices for the three splits with four layers encoder architecture based on the experimental evidence given in Table 1, as discussed below in this section.It is clear from the figure that the proposed model architecture shows quite encouraging performance for all classes except for 'Hit_Bottle', 'Hit_Stick', and 'Stab'.This is due to the fact that, in each of these three classes, the actions performed appear quite similar with different object in hand, i.e., bottle, stick and knife.And since the pose estimator extracts keypoints of only body joints, and not the objects being carried, these classes are difficult to distinguish.Also,  the model confuses between actions of 'Running_fb' and 'Jogging_fb', which appear quite alike too.
The obtained performance by the proposed framework is reported separately for each split using the standard evaluation measures: Precision, Recall, F1-score, and Accuracy (Table 2).The results show that the performance is generally encouraging, considering the number and diversity of action classes under consideration.A point to highlight is that the best performance is mostly obtained with four layers (e.g., see the mean performance scores in Table 2).Overall, the best performance is generally obtained on Split 2.
Table 3 shows the comparison of the proposed method (in the form of the mean performance on all three splits)  with the approaches (High-Level Pose Features based method (HLPF) and Pose-based Convolutional Neural Networks (P-CNN)) that are reported in the original dataset paper [51], as well as a recent related method that used YOLOv8 pose extractor in combination with the Long Short-Term Memory (LSTM) network [52] for action recognition.It is evident that the proposed method shows the best performance in  terms Precision, Recall and F1-score, and a comparable performance in terms of Accuracy as compared to existing methods.
For a more holistic performance comparisons, in Table 4, we have provided a comparison of the proposed model with several other deep learning models (3DResNet, ST-GCN, ResNet101, ResNet18, LSTM) in terms of performance accuracy and inference time per sequence of 30 frames.Here, we practically implemented all of these models on Intel Core i3-5005U processors (two physical cores of 2.0 GHz each) and 4GB of RAM.The proposed model took approximately 7 minutes for 1 run of training for 100 epochs.The results show that the proposed model outperforms existing models both in terms of accuracy and inference time.
In Table 5, we also compared the computational complexity of the proposed Transformer-based model with several related state-of-the-art models based on the number of network parameters (in millions) as well as the number of floating-point operations (FLOPS) (in billions).It is evident that the proposed method performs better than all of the existing methods.

FIGURE 1 .
FIGURE 1. Top: Extracted keypoints of the whole body for the 'Kicking' action on a sample image.Bottom: Extracted keypoints shown in the form of a plot.

FIGURE 2 .
FIGURE 2. Key building blocks of the transformer encoder layer.

FIGURE 3 .
FIGURE 3. Block diagram illustrating different steps required for self-attention.

FIGURE 4 .
FIGURE 4. Data flow and the dimensions of matrices at each step of the encoder layer.

FIGURE 5 .
FIGURE 5. Sample images for each of the 13 action classes of Drone-Action dataset, as used in the experimental evaluation.

TABLE 4 .TABLE 5 .
Comparison of complexity in terms of inference time as well as accuracy of Transformer model with several existing models.Comparison of the computational complexity of the proposed transformer-based method with recent state-of-the-art action recognition methods in terms of the number of parameters and the floating-point operations (FLOPS).

TABLE 1 .
Hyperparameters of the proposed Transformer model for training and testing.

TABLE 2 .
Performance analysis on each split with different number of encoder layers.

TABLE 3 .
Comparison of the mean performance of the proposed method with the existing methods.