A Comparison Between Various Human Detectors and CNN-Based Feature Extractors for Human Activity Recognition Via Aerial Captured Video Sequences

Human detection and activity recognition (HDAR) in videos plays an important role in various real-life applications. Recently, object detection methods such as “you only look once” (YOLO), faster region based convolutional neural network (R-CNN), and EfficientDet have been used to detect humans in videos for subsequent decision-making applications. This paper aims to address the problem of human detection in aerial captured video sequences using a moving camera attached to an aerial platform with dynamical events such as varied altitudes, illumination changes, camera jitter, and variations in viewpoints, object sizes and colors. Unlike traditional datasets that have frames captured by a static ground camera with medium or large regions of humans in these frames, the UCF-ARG aerial dataset is more challenging because it contains videos with large distances between the humans in the frames and the camera. The performance of human detection methods that have been described in the literature are often degraded when input video frames are distorted by noise, blur, illumination changes, and the like. To address these limitations, the object detection methods used in this study were trained on the COCO dataset and evaluated on the publicly available UCF-ARG dataset. The comparison between these detectors was done in terms of detection accuracy. The performance evaluation considers five human actions (digging, waving, throwing, walking, and running). Experimental results demonstrated that EfficientDetD7 was able to outperform other detectors with 92.9% average accuracy in detecting all activities and various conditions including blurring, addition of Gaussian noise, lightening, and darkening. Additionally, deep pre-trained convolutional neural networks (CNNs) such as ResNet and EfficientNet were used to transfer learning from the ImageNet dataset to the UCF-ARG dataset and to extract highly informative features from the detected and cropped human patches. The extracted spatial features were utilized by Long Short-Term Memory (LSTM) to consider temporal relations between features for human activity recognition (HAR). Experimental results found that the EfficientNetB7-LSTM was able to outperform existing HAR methods in terms of average accuracy (80%), average precision (83%), average recall (80%), average F1 score (80%), average false negative rate (FNR) (20%), average false positive rate (FPR) (4.8%), and average Area Under Curve (AUC) (94%). The outcome is a robust HAR system which combines EfficientDetD7, and EfficientNetB7 with LSTM for human detection and activity classification, respectively.

The topic of human activity recognition (HAR) is not new but the challenges that accompany HAR have not been addressed completely. Several datasets have been used in HAR applications, but these datasets have not addressed challenges that accompany the UCF-ARG aerial dataset [110,111] that was used in this paper. Traditional datasets have frames that were captured by static ground cameras that show humans in medium-sized or large regions of these frames. Therefore, existing research works propose solutions to detect humans only when the size or scale of people in the frames is medium or large. We still need to study the robustness of existing human detection methods in aerial surveillance to localize humans that are small in scale, i.e., when the distance between the humans and the camera is large. Furthermore, the frames were captured by a moving (i.e., not fixed) aerial camera, such as that of a UAV.
Real-time detection of humans in aerial captured video sequences has certainly not been without challenges. For instance, the size of a human detected on aerial captured video can vary with the altitude of the UAV. The natural variation in the size of humans to be detected also poses a challenge to the technology [8]. Other problems that arise during the acquisition of aerial captured video include dynamical events such as changes in illumination and pronounced degrees of motion blur that result from camera jitter [35,36]. All these problems need to be addressed to develop a highly robust classification method that is able to distinguish humans from non-humans [8].
Convolutional neural networks (CNNs) have provided machines the ability to use deep learning (DL) in order to detect objects of various sorts. Among the most widely used algorithms for these tasks are YOLOv5 [37], R-CNN [38], Fast R-CNN, Faster R-CNN, Mask R-CNN [39], R-FCN, SqueezeDet [40], EfficientDet [41], MobileNetV2 [42], RetinaNet [78], ShuffleNet [79], and PeleeNet [80]. It is essential that a detection algorithm is able to yield favorable performance metrics at high inference speeds. Several pieces of literature that describe human detection algorithms report such levels of performance using a variety of methods, like body detection [43,44], head detection [45][46][47][48], shoulder detection [49], [50], and abundant others [8,[51][52][53][54][55]. However, in many instances, solely detecting features associated with human objects do not suffice, as human detection algorithms need to be able to accurately detect them despite the existence of obstructions within the field of view of the camera.
Advances in machine learning (ML) have changed the course of various domains over the past few decades. The use of neural networks in ML has led to considerable advancements in many applications, such as computer vision and natural language processing. As the use of ML for the detection of static objects has reached near-perfect levels of accuracy and precision, researchers have begun venturing into developing relatively newer methods to perform tasks of greater complexity. For instance, we have gone from the simple task of detecting humans in aerial captured videos to the objective of this paper, which is human activity recognition (HAR). The investigation of DL techniques for HAR enables us to extract considerably more meaningful information from digitized data that will significantly enhance various real-world functions, such as those earlier mentioned.
Methods for HAR belong to one of three main categories, namely (1) vision-based, (2) non-vision or sensor-based, and (3) multimodal [56]. Vision-based HAR methods make use of depth cameras to obtain color videos with depth information and acquire information on human movements for recognition [57]. However, these methods are highly susceptible to error that may be caused by variations in illumination from the environment and short range of detection. Non-vision or sensor-based HAR methods, on the other hand, utilize various sensors, such as wearable devices and ambient sensors, or combinations of these, that enable us to acquire information on human movements for purposes of recognition. Combining sensor types makes hybrid sensors that enhance data features to be collected. Doing so enables us to gather sensory information from the real environment, e.g. from cyber-physical-social systems [58]. Magnetic sensors built into smartphones can also readily obtain the position of their users [59]. However, relying on sensorderived data alone may be challenging because hardware may prove to be costly and privacy concerns prevent large amounts of data from being made public. Moreover, sensorderived data may need significant domain-specific expertise to obtain appropriate features that an ML model can process and learn from. A high-dimensional and noisy continuous sequence of observations are produced from smartphone sensors. A combination of hierarchical and kernel extreme learning machine (HK-ELM) models were demonstrated to learn features and classify activities [81]. They utilized a feature fusion approach to combine hand-crafted features and H-ELM based learned features. Finally, multimodal HAR techniques allow us to make use of both vision-based and sensor-based data simultaneously to recognize human activity [82]. This is done so that one modality can provide complementary information in order to overcome the limitations of the other modality.
Raw data acquired from sensors and video data from cameras can now be automatically processed and learned by state-of-the art DL techniques, specifically CNNs and recurrent neural networks (RNNs), which have seen tremendous improvements in performance over the years. For instance, [60] developed a system that considers underlying force patterns derived from first-and secondorder dynamics in the input data to classify human actions. They employ a three-layer neural network architecture model, which consists of hierarchical self-organizing maps (SOM) and a supervised neural network model classifier. The SOM in the first layer reduces the dimensionality of the input data and activity patterns in input sequences are extracted to represent posture frames. Moreover, the second layer also consists of an SOM which receives superimposed activity features from the first layer. Because of this, the temporal invariance of the system is ensured. Finally, the clusters in the second layer SOM are labelled using a supervised neural network in the last layer.
Another approach to HAR that is worth discussing is the discrimination of action by observing the coordinates of the joints on a three-dimensional (3D) human skeleton dataset. Over the years, numerous attempts have been made to overcome the challenges of providing an efficient and effective approach to recognizing human activity that leverages these 3D datasets [61 -72]. Hand-crafted feature techniques entail significant amounts of human intervention in extracting valuable features from skeleton sequences. Moreover, the extraction of localized spatial and temporal information from processed raw skeletal joints for the formulation of DL methods is particularly difficult. In fact, spatio-temporal representations of skeletal sequences, such as DL approaches, are not capable of substantially preserving local and global joint information and often suffer from view dependence, absence of motion, and insufficiency of spatial and temporal information [72][73][74][75].
In order to address this, [76] proposed a novel method that maps the 3D skeleton joint coordinates into a spatio-temporal image format (STIF). This, in turn, reduces system complexity and provides features that are able to be discriminated better. A system with four main modules, namely spatio-temporal image formation, transfer learning, fusion, and classification was proposed. Here, skeleton joints are converted into STIF which includes spatial and temporal changes for three planes of view. Then images are included in the backbone model comprising three pre-trained networks, namely MobileNetV2, DenseNet121, and ResNet18, each connected to a fully connected layer to extract highly discriminative features. The features extracted from the three planes of view are then fused three different ways. Finally, the fused features are fed into two subsequent fully connected layers to reduce dimensionality before the action is categorized by a softmax classifier.
There is limited research that uses the UCF-ARG aerial dataset [110,111] because of the following challenges that it comes with: • An aerial camera mounted onto a payload platform of a 13' Kingfisher™ Aerostat helium balloon, • Small size of people in human patches for object detection, • Varying activities, such as raising hands, walking, and bending bodies, performed by people in the human patches.
Human detection and activity recognition (HDAR) using the highly challenging UCF-ARG aerial dataset has been done using various methods [8,77,83,84,85]. The combination of "The Fastest Pedestrian Detector in the West" (FPDW) [114] and moving object detection was utilized for human detection and tracking in UAV-based videos [83]. Another approach to HAR based on aerial captured video sequences that comprises two phases, namely an offline phase and an inference phase, along with scene stabilization, has also been proposed [8,77]. The initial phase uses an AlexNet CNN to create a model that classifies between human and nonhuman and another that classifies human activity [77]. The latter phase detects humans and identifies their actions based on models created in the prior phase. Here, HAR is carried out for each frame of the video and for entire sequences of video frames [77]. Because the regions that contained humans were small and the backgrounds contained other objects such as cars, trees, and boxes, the classification method performed poorly with an accuracy of 68%. Recognition of human activities in UAVbased videos from motion features has been explored by using a bag-of-features approach. Here, visual words were utilized to represent motion features, which were described as a frequency count of the words. The SVM classifier served as the activity detector [84]. Lastly, an automated UAVbased DL algorithm consisted of video stabilization using the surf feature selection and Lucas-Kanade method, human area detection using faster R-CNN, and action recognition using a structure combining a three-dimensional CNN architecture and a residual network [85]. To address limitations encountered by methods described in the literature, we propose the use of EfficientDet-D7 which was the top stateof-the-art detector for human detection to improve detection accuracy, and thus, classification accuracy. This paper has several contributions. Specifically: • It makes use of a novel HDAR system to detect humans and recognize their activities from aerial captured video sequences. • To the best of our knowledge, this is the first paper that uses EfficientDet-D7, a state-of-the-art object detector for human detection in videos with dynamical events such as varied altitudes, illumination changes, camera jitter, and variations in viewpoints, object sizes and colors captured from a moving camera attached to an aerial platform. • It compares and evaluates the performance of three human detectors after adding special distortions on the video frames, such as blur, noise, and illumination changes. • It includes a comparison between various human detectors, such as YOLOv4 [88], faster R-CNN [101], and EfficientDet [41] in terms of detection accuracy. • It includes a comparison between various CNN-based feature extractors in terms of accuracy, precision, recall, F1 score, false negative rare (FNR), false positive rate (FPR), and Area Under Curve (AUC). • It makes use of the highly challenging UCF-ARG dataset for methods evaluation and comparison. • It explores the concept of cross-domain learning to transfer parameters learned from object detectors and the detection features extracted from the COCO dataset [107] to aerial captured video sequences. • It explores the concept of cross-domain learning to transfer the parameters learned from pre-trained CNNs and the recognition features extracted from the ImageNet dataset [108] to aerial captured video sequences. This paper is organized as follows: Section 2 demonstrates the publicly available COCO dataset [109] used for object detection, the publicly available ImageNet dataset [108] used for feature extraction, and the highly challenging UCF-ARG dataset used for human activity recognition. Additionally, state-of-the-art object detection methods such as YOLO4 [88], faster R-CNN [101], and EfficientDet [41] are explored for the purpose of human detection. Furthermore, the use of various pre-trained CNNs such as ResNet50, EfficientnetB0, EfficientnetB4, and EfficientnetB7 for transfer learning and spatial feature extraction is demonstrated. Finally, the use of LSTM for logging temporal features is explained in detail. In Section 3, the experimental setup, and results are discussed to compare between various human detectors and CNN-based feature extractors. Section 4 summarizes the outcome, significance, and plans for future improvements of this work.

II. MATERIALS AND METHODS
This section describes the video datasets utilized in this research work. Furthermore, it discusses various object detection models. Additionally, various convolutional neural networks are demonstrated. Finally, this section explores the recurrent neural network model for the ultimate objective of human activity classification.

A. Dataset Overview
In this paper, the UCF-ARG dataset [110,111] that was acquired using three cameras: an Aerial camera, a Rooftop camera, and a Ground camera (ARG) by the University of Central Florida (UCF) was uses. Here, we focused only on the most challenging dataset that contains videos captured by a high-definition aerial camera mounted on the payload platform of a helium balloon with a resolution of 1920 × 1080 pixels at 60 fps. The challenges in this dataset are summarized as follows: 1) the frames are varied in terms of viewpoints, color of clothing, positions, orientations, and human sizes; 2) the camera altitudes are varied when the airborne platform is moved; 3) the dataset contains ten human activities performed by twelve different individuals and captured from several views.
The environments where the aerial videos were captured include three car parks in various locations. Ten activities, namely digging, throwing, waving, walking, running, clapping, boxing, jogging, carrying, and opening/closing a car trunk were performed four times by each individual. Therefore, 48 videos were recorded for each activity. This paper used five of these activities, namely throwing, waving, digging, walking, and running. Three of these activities were static, i.e., performed in place (waving, dogging throwing), while two were dynamic (walking and running). These five activities were selected to have a fair comparison with other works which used the same five activities for human activity classification [77,84,85]. The performance of human detection models and human activity classification models were evaluated using these five activities. Figure 1 illustrates various human samples detected and cropped from video frames. The human images are of various sizes but were uniformly resized for visualization. The human image patches in Figure 1 have various backgrounds, colors of clothing, human sizes, activities, and viewpoints.
The numbers of frames in the videos were varied between less than ten and a few hundreds. Only ten frames were selected from each video, yielding a total of 240 × 10 = 2400 frames for evaluation and comparison between various human detection models. All video frames were considered for the human activity classification task.

B. METHODS
This section discusses various object detection models such as YOLOv4, faster R-CNN, and EfficientDet used for human detection in videos. Additionally, various CNNs such as ResNet and EfficientNet used for feature extraction from video frames are demonstrated. Finally, this section explores the recurrent neural network model called long short-term memory (LSTM) to record the history of extracted features for the ultimate objective of human activity classification. Figure 2 shows the block diagram of the proposed HAR system.

1) HUMAN DETECTION
Various object detectors such as YOLOv4, faster R-CNN, and EfficientDetD7 pre-trained on the COCO dataset with 91 categories of objects were used in this paper for the human detection task. We used these human detectors with learned parameters without finetuning them on the UCF-ARG dataset. In other words, one of our objectives was to evaluate and compare the performance of these object detectors for human detection in challenging aerial videos.

a) YOU ONLY LOOK ONCE (YOLO)
YOLO is a real-time detection model that offers a good balance between high inference speed and good accuracy [86,87,88]. Training in YOLO is performed by applying full images to an end-to-end neural network to utilize convolutional layers for feature extraction and fully connected layers for classification and bounding box prediction. The advantage of YOLO is its ability to see the full images during the training stage which results in a remarkable reduction in the number of background errors when compared with fast R-CNN. However, YOLO has more localization errors. Additionally, YOLO includes 24 convolutional layers added before two fully connected layers. The architecture of YOLO was inspired by the GoogleNet CNN used for image classification. The input image has 224 × 224 pixels that are fed into convolutional layers for training on the ImageNet dataset. On the other hand, YOLO has double the resolution for detection [86,87]. YOLOv4 is a more recent, faster and more accurate object detector [88] with many improvements in its architecture and training strategy. It was designed to run in real time and to be trained using only one GPU. YOLOv3 has been demonstrated in various applications and was able to yield high performance in detecting nude humans in pornographic videos [112]. However, it was found that YOLOv4 outperformed YOLOv3 in terms of detection accuracy [88]. YOLOv4 consists of the following components: 1. Backbone module: CSPDarknet53 [89] for feature extraction, 2. Neck module: Spatial Attention Module (SAM) [90], Spatial Pyramid Pooling (SPP) [91], and Path Aggregation Network (PAN) [92] to enhance the receptive field presented by the Backbone 3. Head module: YOLOv3 [93] to predict the final output which are the bounding boxes and the classification scores for each object. The main improvements in YOLOv4 are in the Neck module and in the training strategy. The Neck module consists of three submodules, namely SPP, modified PAN, and modified SAM. The SPP submodule was added over CSPDarknet53 to increase the receptive field; the modified PAN Net submodule was added as a method of parameter aggregation from different CSPDarknet53 backbone levels; while the modified SAM submodule is an attention mechanism applied over the feature maps.
In this paper, we demonstrated YOLOv4 with the CSPDarknet53 network as a backbone for human detection in aerial videos. The video frames have a size of 540 x 960 pixels that were applied directly to YOLO. YOLOv4 was selected to balance the tradeoff between the accuracy of detection and the speed. YOLOv4 usually runs twice as fast as EfficientDet with comparable performance. In this work, YOLOv4 was tuned to filter and detect only humans and ignore other classes. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

b) FASTER REGION BASED CONVOLUTIONAL NEURAL NETWORK (R-CNN)
Region-based CNNs are computationally expensive. Faster R-CNN was found to enhance the detection accuracy and the run time of fast R-CNN. Furthermore, faster R-CNN also outperformed YOLOv3 in terms of detection accuracy [88,93]. A faster R-CNN [101] object detector consists of two modules. The first module is a region proposal network (RPN) including a fully convolutional network that uses an attention mechanism [102] in order to enhance the feature maps. RPN takes an image of arbitrary size and proposes regions by producing a set of rectangles, each with its objectness score. The feature maps in RPN are shared with the detection network of a fast R-CNN [103], which is the second module that utilizes the proposed regions. To generate region proposals, a small network which has an × spatial window as input is slid over the last shared convolutional feature map and maps the window to lower dimensional features (256-d for ZF). Additionally, the features are fed into two fully connected layers including a box regression layer (reg) and a box classification layer (cls). At each location of the sliding window, k region proposals (boxes) are predicted. The classification layer has 2k scores (object or not object for each box), and the regression layer has 4k outputs (four coordinates for each box) [101]. Figure 3 illustrates the regional proposal network (RPN). In this paper, we demonstrated a faster R-CNN with ResNet instead of ZF and VGG for human detection in aerial videos. The video frames have a size of 540 × 960 pixels that were applied to the detector. The faster R-CNN was tuned to filter and detect only persons and ignore other classes.

c) EfficentDet
EfficientDet is a state-of-the-art object detection method which can yield higher accuracy with much fewer parameters and FLOPs than prior methods [41].
EfficientDet proposed a bi-directional feature pyramid network (BiFPN) used for multi-scale feature fusion. It also has a family of various architectures (D0 … D7). Additionally, EfficientDet proposed a compound scaling method that uniformly scales the resolution, depth, and width for all backbones, feature networks, and box/class prediction networks simultaneously. Figure 4 shows the BiFPN features aggregation method. In EfficientDet, BiFPN takes features at levels {P 3 , P 4 , P 5 , P 6 , P 7 } from the EfficientNet [104] backbone. Furthermore, BiFPN is applied repeatedly L times, where L is related to the EfficientDet version. Finally {P 3 , P 4 , P 5 , P 6 , P 7 } are fed into the class/box network. Figure 5 shows the general architecture of the EfficientDet detector.
Where ∅ is the compound scaling value that is related to the EfficientDet version. Table I

Sliding Window
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  the 240 videos to have 2400 original frames in total. The 2400 selected frames were modified as follows: 1) Flipping frames horizontally, 2) Blurring frames, 3) Adding Gaussian noise to frames, 4) Lightening the frames, 5) Darkening the frames, 6) Converting from the RGB color space to the grayscale color space. The evaluation and comparison between YOLOv4, fast R-CNN, and EfficientDetD7 was also done on all modified frames. The performance is measured using the detection accuracy as follows:

2) PRE-TRAINED CNNS FOR FEATURE EXTRACTION
After the human detection model, a set of human patches or ROIs that were cropped from the video frames were applied to CNN-based feature extraction to extract a sequence of features from a sequence of frame ROIs. The sizes of image patches are not equal because altitudes are varied, hence varying the distances between the aerial camera and the human individuals. Therefore, the patches were resized before being applied to pre-trained CNNs to extract the spatial features.
In this paper, the technique of transfer learning was demonstrated to transfer representations from the ImageNet domain to the aerial imagery domain. The parameters of the CNN pre-trained on the ImageNet 1K dataset were utilized for feature extraction without further finetuning of parameters in the first layers. The top layers of pre-trained CNNs were removed and were replaced by LSTM [105], which was tuned with a small-scale UCF-ARG dataset. Various recent architectures of CNNs such as ResNet [106] and EfficientNet [104], including EfficientNetB0, EfficientNetB4, and EfficientNetB7, were pre-trained with natural images of ImageNet to learn high-level (objects and shapes) and low-level (textures, edges, and colors) representations from aerial video frames. The experiments were carried out to compare between the previously mentioned pre-trained CNNs. The image patches or ROIs that were cropped from the EfficientDet human detection model were resized to 224 × 224 in ResNet50, EfficientNetB0, and EfficientNetB4. On the other hand, they were resized to 600 × 600 in EfficientNetB7. The features extracted by each pre-trained CNN have the following dimensions: 2048 in ResNet50, 1280 in EfficientNetB0, 1792 in EfficientNetB4, and 2560 in EfficientNetB7.

a) ResNet
The residual learning framework, also called ResNet, is a very deep network that yields very good generalization without overfitting [106]. In ResNet, different numbers of layers, e.g., 50, 101, and 152 may be used. A supervised learning model feeds a ResNet CNN with large-scale labelled dataset, such as ImageNet in the training stage. The ResNet layers are reformulated as learning residual functions with reference to the layer inputs.
In this paper, the ROIs of humans cropped from aerial video frames were resized to 224 × 224 pixels and were applied to ResNet50 to extract 2048 features. The top layers of ResNet50 were removed. The LSTM was then added to utilize the sequence of features extracted from the sequence of ROIs cropped from the sequence of video frames.

b) EfficientNet
Model scaling is usually done by increasing network depth or network width or by increasing the resolution of input images used for training and evaluation. Even the accuracy is improved through model scaling methods. The drawback, however, is that it entails more manual tuning. While This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182315 balancing depth, width, and resolution of a network, EfficientNet was found to speed up the inference and outperform the accuracy of existing state-of-the-art CNNs on ImageNet [104]. To improve accuracy, various architectures of EfficientNet are available including B0 as the baseline network; and B1, B2, B3, B4, B5, B6, and B7 as scaling networks. However, more FLOPs is the cost of accuracy improvement.
In this paper, the ROIs of humans cropped from aerial video frames were resized to 224 × 224 pixels in EfficientNetB0 and EfficientNetB4, and to 600 × 600 pixels in EfficientNetB7. Furthermore, they were applied to EfficientNetB0, EfficientNetB4, and EfficientNetB7 to extract 1280, 1792, and 2560 features, respectively. The top layers of EfficientNet were removed. The LSTM was then added to utilize the sequence of features extracted from the sequence of ROIs cropped from the sequence of video frames.

c) Long Short-Term Memory for Time Series Classification
A Recurrent Neural Network (RNN) was utilized for sequence modelling to capture temporal correlations [105]. LSTM is a special type of RNN that has been found to slow down gradient vanishing. LSTM has a memory cell to accumulate state information supported by control gates for long-range sequence modelling as shown in Figure 6. In this work, a sequence of features extracted from ROIs of frames were applied to the LSTM that was used to replace the top layers of pre-trained CNNs. Additionally, LSTM was trained, and its parameters were fine-tuned iteratively to fit the features extracted from ROIs cropped from UCF-ARG video frames.
Several experiments were conducted to select the optimal LSTM structure and architecture that can produce the best performance metrics. The LSTM architecture includes activation functions, the number of LSTM layers, the number of nodes in each layer, and the number of fully connected layers. In this work, the optimal LSTM architecture consists of the following layers: 1) GlobalMaxPool2D 2) Bidirectional_LSTM with 512 nodes, tanh activation and sigmoid recurrent activation 3) Fully connected layers with 256 nodes and ReLU activation function 4) Fully connected layers with five nodes 5) Softmax activation function Additionally, several experiments were conducted to select the optimal LSTM hyperparameters that can produce the best performance metrics. The hyperparameters include the number of epochs, the optimizer type, the loss function, the learning rate, and the batch size. In this work, the optimal LSTM hyperparameters are as follows: 1) The learning rate used to train the LSTM model was set to 0.001 2) The batch size was set to 32.
3) The number of epochs was set to 50. 4) The loss function was Categorical_Crossentropy. 5) The optimizer was Adam. In summary, an LSTM architecture with hyperparameters described earlier was used for the classification of a time series including a sequence of features extracted from ROIs cropped from UCF-ARG video frames. The video class or activity class is based on the history of the extracted features. The output of the LSTM is one of five human activities, namely digging, throwing, waving, walking, and running.
Several experiments were conducted to compare various CNN-based feature extraction approaches. The comparison was done by considering all frames in the 240 videos. The performance is measured using the following performance metrics: 1. Accuracy calculates the number of correctly predicted videos over all videos.
2. Recall (Sensitivity) calculates the number of correctly predicted positive videos over all actual positive videos.
3. Precision calculates the number of correctly predicted positive videos over all predicted positive videos.
4. F1 score summarizes recall and precision into one quantity. In Equations 2 through 7, TP is True Positive, TN is True Negative, FP is False Positive, and FN is False Negative.
7. Area Under ROC Curve (AUC) determines the ability of a classifier to distinguish between classes.
The higher the accuracy, recall, precision, F1 score, and AUC are, the better is the model performance. Moreover, the lower the FNR and FPR, the better the model performance is.

A. EXPERIMENTAL SETUP
The experiments for human detection and human activity classification were conducted using Python with TensorFlow and OpenCV on Google Colaboratory using an NVIDIA Tesla K80 GPU with 12 GB of memory.
In this work, 48 videos for each of five activities were used, making 240 videos in total. In the human detection task, 2400 frames were selected from 240 videos. The evaluation and comparison between human detection models including YOLOv4, faster R-CNN, and EfficientDet were done to find the best detector that was able to detect humans in the largest number of frames out of 2400 frames. The best detector should also be robust against various effects applied to video frames such as horizontal flipping, blurring, addition of Gaussian noise, lightening, darkening, and conversion from RGB to grayscale.
In the human activity classification task, all patches or ROIs cropped from all video frames were selected from each of 240 videos. The videos were divided into training data and testing data. We followed the same protocol used in state-of-the-art methods that utilized the same UCF-ARG dataset. The protocol states that one of twelve persons was used for testing and all other eleven persons were used for training and validation. In other words, in each of 12 experiments, the videos were divided into 220 for training and 20 for testing. The evaluation and comparison between pre-trained CNN-based feature extractors were done on ResNet50, EfficientNetB0, EfficientNetB4, and EfficientNetB7. The objective is to find the best pre-trained CNN that would be able to produce the best performance metrics in terms of accuracy, recall, precision, F1 score, False Negative Rate (FNR), False Positive Rate (FPR), and Area Under Curve (AUC).

1) HUMAN DETECTION EXPERIMENTS
The first set of experiments were conducted to compare between various human detectors including YOLOv4, faster R-CNN, EfficientDetD0, EfficientDetD4, and EfficientDetD7 using video frames. In these experiments, 2400 frames were selected from each set that included original frames and those augmented by flipping, blurring, addition of Gaussian noise, darkening, whitening, and conversion to grayscale.
First, a comparison was done using frames of the original set that contains five activities: digging, waving, throwing, walking, and running. Table II shows the detection accuracy of each activity and the average of all five activities. EfficientDetD7 was found to outperform other human detectors with an average accuracy of 97.1%. On the other hand, EfficientDetD0 produced the lowest average accuracy of 81.7%. It is obvious that EfficientDetD7 outperformed others in detecting the following activities: digging, throwing, and running. However, YOLOv4 yielded better accuracies for waving and walking.

a) Ablation Study
We replaced the model of faster R-CNN that was used by Peng et al. 2020 [85] to detect and localize humans in video frames. In our research work, we used EfficientDet instead of faster R-CNN and compared the detection accuracy between both models to evaluate their performance when various challenges are available in frames. We found that our proposed human detector was able to improve the detection accuracy significantly and specifically when there are noise, blur, and illumination changes in the frames. Tables II, III, IV, V, VI, VII, VIII show the comparisons between faster R-CNN and EfficientDet. Second, a comparison was done using frames that contained the five activities with Gaussian noise added. Table III shows the detection accuracy of each activity and the average of all five activities. EfficientDetD4 was found to outperform other human detectors with an average accuracy of 93.8%. It is obvious that EfficientDetD4 outperformed others in all activities. Additionally, EfficientDetD7 produced the second highest average accuracy of 92%. On the other hand, other detectors including YOLOv4 and faster R-CNN yielded an average accuracy that was lower than EfficientDetD4 by >10%. Furthermore, the performance of EfficientDetD0 was the worst at only 34%. In summary, the results indicate that EfficientDetD0 was not robust against the Gaussian noise added to the frames. On the contrary, the accuracies of YOLOv4, and faster R-CNN were degraded significantly compared with the ones without noise. Finally, EfficentDetD4 and D7 are highly robust against the addition of Gaussian noise to the video frames.
Third, a comparison was done using frames that have five activities after having been blurred. Table IV shows the detection accuracy of each activity and the average of all five activities. Although EfficientDetD7 showed degradation in accuracy, it yielded superior performance compared with other human detectors with an average accuracy of 76.3%. Additionally, EfficientDetD7 outperformed other detectors in all five activities. However, other architectures of EfficientDet such as D0 and D4 produced lower accuracies. Furthermore, other detectors including YOLOv4 and faster R-CNN yielded accuracies that were >20% lower than that of EfficientDetD7. In summary, the results indicate that blurring negatively impacts human object detectors in general. However, EfficientDetD7 was still the most robust human detector despite blurring.  Fourth, a comparison was done using frames that have five activities after being flipped horizontally. Table V demonstrates the detection accuracy of each activity and the average of all five activities. EfficientDetD7 yielded the best performance with an average accuracy of 97.3%. Although, EfficientDetD7 outperformed other detectors at detecting throwing, walking, and running, EfficientDetD4 was better at detecting digging, and waving. On the other hand, EfficientDetD0 produced the lowest average accuracy of 82.7%.

TABLE V A COMPARISON BETWEEN VARIOUS HUMAN DETECTORS USING VIDEO FRAMES FLIPPED HORIZONATALLY
Fifth, to study the performance of human object detectors in the grayscale color space, a comparison was done using video frames that contained five activities after being converted from RGB to grayscale. Table VI presents the detection accuracy for each activity and the average of all five activities. EfficientDetD4 outperformed other detectors in all activities except walking and produced the best average accuracy of 96.6%. Additionally, EfficientDetD7 yielded the second-best average accuracy. On the other hand, EfficientDetD0 yielded the worst performance with an average accuracy of only 61.5%. Sixth, to study the performance of human object detectors under various illumination (light) conditions, a comparison was done after darkening and whitening the video frames that contained five activities.  Table VIII describes the detection accuracy of each activity and the average of all five activities after whitening the video frames. EfficientDetD4 outperformed other detectors for throwing and digging and produced the best average accuracy of 96.5%. Additionally, EfficientDetD7 outperformed other detectors for running and yielded the second-best average accuracy. Furthermore, YOLOv4 outperformed other detectors for waving and walking. On the other hand, EfficientDetD0 yielded the worst performance with an average accuracy of 79.4%.

TABLE VIII A COMPARISON BETWEEN VARIOUS HUMAN DETECTORS AFTER WHITENING THE VIDEO FRAMES
A comparison between three architectures of EfficientDet, namely D0, D4, and D7 was demonstrated by plotting the histograms of confidence scores for each detector for 2400 frames. A confidence score reports the probability of prediction of a human category. Figure 7 illustrates the histogram comparison for original video frames. EfficientDetD7 in yellow yielded higher scores than EfficientDetD4 in green. On the other hand, EfficientDetD0 yielded the lowest scores in purple.  Figure 8 compares the histograms of confidence scores after blurring or adding Gaussian noise to video frames. EfficientDetD7 in yellow yielded higher scores than EfficientDetD4 in green. On the other hand, EfficientDetD0 yielded the lowest scores in purple. It is obvious that confidence scores for the original video frames have higher confidence score values than those with Gaussian noise. On the other hand, blurring video frames leads to a reduction in confidence scores for all human detectors. However, EfficientDetD7 was still the most robust detector despite blurring as confirmed by the histograms of confidence scores shown in Figure 8.
In summary, it is obvious based on Tables II through VIII that EfficientDetD0 yielded the worst performance with the lowest accuracies in all scenarios. This may be because of the variety of activities performed by humans, and their small sizes in aerial videos. On the other hand, the performances of the YOLOv4 and Faster R-CNN human detectors were good in all scenarios except in those with blurring and Gaussian noise.
Tables II through VIII also show that EfficientDetD7 outperformed other human detectors in many scenarios including those in original frames and those augmented by flipping horizontally, by blurring, and by darkening. Similarly, EfficiendDetD7 yielded the second-best performance in other scenarios such as those that were augmented by adding Gaussian noise, by whitening, and by converting to grayscale. The comparison between various human detectors was done by calculating an average accuracy for each human detector for all activities and all scenarios as shown in Table IX. As a result, we deduced that EfficientDetD7 would be a good human detector that can be utilized in aerial captured video sequences. The power of EfficientDetD7 results from its robustness against various human size, cloth color, views, and positions. Moreover, it can detect humans even in the presence of various factors affecting the video frames such as noise, blur, light change, and grayscale color. Therefore, we utilized EfficientDetD7 in human activity classification experiments to detect and crop ROIs of humans from frames. The cropped patches or ROIs were then applied to CNN-based feature extraction models. Figures 9, 10, and 11 show three frames with blurring for five human detectors including faster R-CNN, YOLOv4, EfficientDet0, EfficientDetD4, and EfficientDetD7. While EfficientDetD7 was able to detect humans in the three frames, EfficientDet4 was able to detect only two humans in two frames and misclassified one human in one frame. On the other hand, faster R-CNN, YOLOv4, and EfficientDetD0 were not able to detect any humans in all three frames.  They detected only objects that were irrelevant in the background, such as cars. Figures 12, 13, and 14 show three frames with three scenarios including darkening, converting to grayscale, and whitening, respectively for five human detectors: Faster R-CNN, YOLOv4, EfficientDetD0, EfficientDet4D, and EfficientDetD7. While EfficientDetD7 was able to detect humans in the three frames in the three scenarios, EfficientDetD4 was able to detect only two humans in two frames that were converted to grayscale, and in two frames that were whitened. EfficientDetD0 was able to detect only one human in the whitened frames. On the other hand, faster R-CNN and YOLOv4 failed to detect humans in the three frames and detected only objects that were irrelevant in the background. Figures 15 shows one frame with multiple human individuals detected using five human detectors: Faster R-CNN, YOLOv4, EfficientDetD0, EfficientDet4D, and EfficientDetD7. It is obvious that EfficientDetD7 was able to detect all seven humans in the frame. Furthermore, EfficientDetD4, Faster R-CNN and YOLOv4 were able to detect all humans except one standing behind the white car. On the other hand, EfficientDetD0 was not able to detect five of the humans. It detected only two humans and merged them incorrectly in one box.
Figures 16 shows one frame with multiple humans for five human detectors: Faster R-CNN, YOLOv4, EfficientDetD0, EfficientDet4D, and EfficientDetD7. EfficientDetD7 was able to detect all seven humans in the frame. Furthermore, EfficientDetD4, and Faster R-CNN were able to detect all humans except one standing behind the white car. Additionally, YOLOv4 was able to detect all humans in the frame but also detected irrelevant objects in the background. On the other hand, EfficientDetD0 was not able to detect six of the seven humans.   Figure 17 shows one frame with multiple humans for five human detectors. It is obvious, that EfficientDetD7 was able to detect all eight humans in the frame. While EfficientDetD4, was able to detect seven of the eight humans, it failed to detect the human bending over to pick something up from the ground. Faster R-CNN detected the car trunk as a human and YOLOv4 detected an object on the ground as a human. On the other hand, EfficientDetD0 failed to detect seven of the eight humans on the frame.
Figures 18 shows one frame with multiple humans for five human detectors. It is obvious, that EfficientDetD7 and YOLOv4 were able to detect all eight humans in the frame. However, YOLOv4 detected an object on the ground as a human. While EfficientDetD4, was able to detect seven of the eight humans, it failed to detect the person only whose upper body is visible. Moreover, faster R-CNN was able to detect six of eight humans. On the other hand, EfficientDetD0 failed to detect seven of the eight humans. Figures 19 shows one frame with multiple humans for five human detectors. It is obvious that EfficientDetD7 was able to detect all seven humans in the frame. On the other hand, EfficientDetD4, Faster R-CNN, and YOLOv4 were able to detect six of the seven humans, but failed to detect the human standing behind the white car. Unfortunately, EfficientDetD0 failed to detect six of the seven humans.

2) HUMAN ACTIVITY CLASSIFICATION EXPERIMENTS
The second set of experiments were conducted to compare between various CNN-based feature extraction models added before the LSTM architecture for human activity classification using human ROIs cropped from the video frames. In these experiments, all original video frames were selected from each video. The number of videos for the five activities, namely digging, waving, throwing, walking, and running is 240 (48 for each activity). Tables X through XIII show the performance metrics for each CNN, namely ResNet50, EfficientNetB0, EfficientNetB4, and EfficientNetB7. The metrics were calculated, running each model 12 times. Each time, we took one person out of 12 for testing and the other 11 for training and validation. The accuracy, recall, precision, F1 score, FNR, and FPR were calculated for each person for five activities. Additionally, an average for each metric was calculated to compare the CNNs and to find the best candidate for the HAR task.
In Table X, ResNet50 was evaluated for 12 persons with five activities. An average accuracy of 75% and an average F1 score of 74% were calculated.  Table XI, the performance of EfficientNetB0 was evaluated for 12 persons with five activities. An average accuracy of 65% and an average F1 score of 62% were calculated. On the other hand, Table XII demonstrates the performance of EfficientNetB4 for 12 persons with five activities. The average accuracy of EfficientNetB4 was 71% which outperforms the accuracy of EfficientNetB0 by 6%. Additionally, the average F1 score of EfficientNetB4 was 68%, which outperforms F1 score of EfficientNetB0 by 6%. Neither EfficientNetB0 nor EfficentNetB4 was able to outperform ResNet50, which was better by 4% in terms of accuracy and 6% in terms of F1 score.   FPR (4.8%). These results proved that EfficientNetB7 was the best of the candidate CNN-based feature extractors for the HAR task. Figure 20 shows the average for each metric: accuracy, recall, precision, F1 score, FNR, and FPR for the four CNNs combined with the LSTM architecture. Figure 21 shows twelve confusion matrices. Each matrix was found for each of 12 persons using a combination of EfficientNetB7 and the LSTM architecture. The numbers in the main diagonal are greater than surrounding values. Figure 21 demonstrates the problem of recognizing digging as throwing and throwing as digging in some cases. Additionally, another recognition error appeared between throwing and waving in a few cases. While digging was recognized correctly in nine of 12 cases, throwing and walking were recognized correctly in only five cases. On the other hand, running was the most correctly recognized activity, as it was recognized correctly in most (11 of 12) cases.
The performance of CNN-based feature extraction models was also evaluated in terms of area under curve (AUC). EfficientNetB7+LSTM was found to outperform other CNNs, as it yielded an AUC of 94%. The second top AUC of 93% was achieved by ResNet50. Additionally, EfficientNetB4 obtained the third top AUC of 92%. On the other hand, the worst AUC of 90% was obtained by EfficientNetB0.
Finally, the proposed method that includes EfficientDetD7 for detection, EfficientNetB7 for feature extraction, and LSTM for classification was compared with state-of-the-art methods that used the UCF-ARG dataset for human activity recognition as shown in Table XIV. The comparison was done in terms of activity classification accuracy. Burghouts et al [84] proposed the first HAR method that extracted motion features from videos and utilized these features for activity classification. This method yielded an accuracy of 57%. Additionally, Burghouts et al [84] proposed a second method that utilized tracking and focus attention for classification. This method yielded an accuracy of 75%. Furthermore, Hazar et al [77] utilized optical flow stabilization to propose ROIs that can detect humans using AlexNet and classify activity using GoogleNet. The method yielded an accuracy of 68%. Peng et al [85] also targeted this dataset using speeded-up robust features (SURF) for stabilization, faster R-CNN for detection, and Inception-ResNet-3D for classification. This method yielded an accuracy of 73.72 %. The proposed method was found to outperform other methods by producing an accuracy of 80%.
In summary, deep CNNs pre-trained on ImageNet 1K were used to transfer representations and features from ImageNet to aerial video frames. It was found that EfficientNetB7 representations are more informative when distinguishing between various human activities than other CNNs such as EfficientNetB0, EfficientNetB4, and ResNet50.

FIGURE 20. Performance metrics of various CNN-based feature extractors combined with LSTM
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

IV. CONCLUSION AND FUTURE WORK
The work presented in this paper targeted two tasks: human detection (HD) and human activity recognition (HAR). The publicly available UCF-ARG aerial dataset was used to evaluate the performance of the proposed HDAR system. In this video dataset, a moving camera attached to an aerial platform was utilized to capture aerial video sequences. This dataset has highly challenging content with dynamical events such as varied altitudes, illumination changes, camera jitter, and variations in viewpoints, object sizes and colors. Various human object detectors pre-trained on the COCO dataset, such as YOLO, faster R-CNN, and EfficientDet were evaluated to select the best detector that can detect humans and localize them inside the video frames. Several experiments were conducted to compare previously mentioned human detectors. Additionally, various versions of EfficientDet including D0, D4, and D7 were compared. Furthermore, we demonstrated the capability of object detectors to detect humans performing various actions, such as digging, waving, throwing, walking, and running. Second, we added various effects on video frames by flipping horizontally, blurring, adding Gaussian noise, lightening, darkening, and converting RGB to grayscale in order to validate the robustness of the object detectors. The objective of human detection was to detect and crop human patches (ROIs) from video frames. It was found that EfficientDetD7 outperformed other detectors with an average detection accuracy of 92.9%. This research proposed new challenges to the UCF-ARG aerial dataset by adding various distortions such as blur, noise, and illumination changes. The performance of three human detectors in these poor frame conditions was evaluated. Our evaluation showed that the performance of a faster R-CNN human detector is degraded when these distortions are added. On the other hand, it showed that EfficientDet was robust against these distortions and can detect humans in all conditions included in the evaluation.
Furthermore, several experiments were carried out to compare various deep pre-trained CNNs, such as ResNet50, EfficientNetB0, EfficientNetB4, and EfficientNetB7, which were used to extract spatial features. The extracted features were utilized by LSTM to consider temporal relations between features for human activity classification. Experimental results found that EfficientNetB7-LSTM was able to outperform other CNNs in terms of average accuracy (80%), average precision (83%), average recall of (80%), average F1 score (80%), average false negative rate In summary, a combination of EfficientDetD7 for human detection, EfficientNetB7 for feature extraction, and LSTM for time series classification was proposed to develop a novel HAR system with good performance. The limitation in the proposed HAR system was its poor ability to distinguish between throwing and digging. Moreover, the dataset was small with only 240 videos for five activities. Furthermore, current HAR systems utilize only features extracted from video frames using parameters learned on the ImageNet dataset. In other words, all layers of the CNNs were frozen except the top layers, which were replaced with LSTM. Hence, in the future, we plan to improve the performance of our proposed method by finetuning all layers of the CNNs with aerial video frames to enhance accuracy. Lastly, more recently developed deep learning models, such as the vision transformer [113], may be good candidates for the enhancement of recognition performance.