FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial Video Classification

Unmanned aerial vehicles (UAVs) are now widely applied to data acquisition due to its low cost and fast mobility. With the increasing volume of aerial videos, the demand for automatically parsing these videos is surging. To achieve this, current research mainly focuses on extracting a holistic feature with convolutions along both spatial and temporal dimensions. However, these methods are limited by small temporal receptive fields and cannot adequately capture long-term temporal dependencies that are important for describing complicated dynamics. In this article, we propose a novel deep neural network, termed Fusing Temporal relations and Holistic features for aerial video classification (FuTH-Net), to model not only holistic features but also temporal relations for aerial video classification. Furthermore, the holistic features are refined by the multiscale temporal relations in a novel fusion module for yielding more discriminative video representations. More specially, FuTH-Net employs a two-pathway architecture: 1) a holistic representation pathway to learn a general feature of both frame appearances and short-term temporal variations and 2) a temporal relation pathway to capture multiscale temporal relations across arbitrary frames, providing long-term temporal dependencies. Afterward, a novel fusion module is proposed to spatiotemporally integrate the two features learned from the two pathways. Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results. This demonstrates its effectiveness and good generalization capacity across different recognition tasks (event classification and human action recognition). To facilitate further research, we release the code at https://gitlab.lrz.de/ai4eo/reasoning/futh-net.


FuTH-Net: Fusing Temporal Relations and Holistic
Features for Aerial Video Classification Pu Jin, Lichao Mou, Yuansheng Hua, Gui-Song Xia, Xiao Xiang Zhu Abstract-This work has been accepted by IEEE TGRS for publication. Unmanned aerial vehicles (UAVs) are now widely applied to data acquisition due to its low cost and fast mobility. With the increasing volume of aerial videos, the demand for automatically parsing these videos is surging. To achieve this, current researches mainly focus on extracting a holistic feature with convolutions along both spatial and temporal dimensions. However, these methods are limited by small temporal receptive fields and cannot adequately capture long-term temporal dependencies which are important for describing complicated dynamics. In this paper, we propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification. Furthermore, the holistic features are refined by the multi-scale temporal relations in a novel fusion module for yielding more discriminative video representations. More specially, FuTH-Net employs a twopathway architecture: (1) a holistic representation pathway to learn a general feature of both frame appearances and shortterm temporal variations and (2) a temporal relation pathway to capture multi-scale temporal relations across arbitrary frames, providing long-term temporal dependencies. Afterwards, a novel fusion module is proposed to spatiotemporal integrate the two features learned from the two pathways. Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results. This demonstrates its effectiveness and good generalization capacity across different recognition tasks (event classification and human action recognition). To facilitate further research, we release the code at https://gitlab.lrz.de/ai4eo/reasoning/futh-net.

I. INTRODUCTION
B Y the virtue of low-cost, real-time, and high-resolution data acquisition capacity, unmanned aerial vehicles (UAVs) can be exploited for a wide range of applications [1]- [17] in the field of remote sensing, such as object tracking and surveillance [5]- [10], traffic flow monitoring [11]- [14], and precision agriculture [15]- [17]. With the proliferation of UAVs worldwide, the number of produced aerial videos is significantly increasing. Hence there is an escalating demand for automatically parsing aerial videos, because it is unrealistic for humans to screen such big data and understand their contents. Therefore, aerial video classification becomes an important task in aerial video interpretation [18].
Feature learning and representation from videos is crucial for this task. Convolutional neural networks (CNNs) have demonstrated the superb capability of learning effective visual representations from images. For instance, ResNet [19] has achieved an impressive performance on the ImageNet dataset, which is even better than the reported human-level performance [20]. Compared to a sequence of remote sensing images in which the temporal information is limited due to relatively long satellite revisit periods, an overhead video is able to deliver more fine-grained temporal dynamics that are essential for describing complex events. Therefore, moving from image recognition to video classification, much effort has been made to learning spatiotemporal feature representations.
On the one hand, several methods [21]- [38] aim at learning a global spatiotemporal feature representation that can holistically represent a video. A straightforward idea is to extract spatiotemporal features on each video frame individually by making use of 2D convolutions and then pool stacked feature maps across the temporal domain [21]. However, this could lead to the ignorance of temporal relations among various frames. To address this, [22] and [23] employ recurrent neural networks (RNNs) such as long short-term memory (LSTM) [39] to model temporal relations by integrating features over time. But the effectiveness of such methods usually depends heavily on the learning effect of long-term memorization. Furthermore, 3D CNNs are fairly natural models for video representation learning and able to learn global spatiotemporal features by performing 3D convolutions in both spatial and temporal dimensions. Some 3D CNN architectures [31]- [38] have been investigated and shown impressive performance. For instance, in [31], the authors propose a 3D CNN model with 3×3×3 convolution filters for learning a video representation on a large-scale video dataset. Nonetheless, massive computational consumption and memory demand hinder efforts to train a very deep 3D CNN, and limit the performance of 3D CNN architectures. To address this problem, inflated 3D convolution filters [35] and decomposed 3D convolution filters [36], [37] utilize a more economic method to implement 3D convolutions and boost the performance of 3D CNNs. However, the aforementioned methods with either 2D or 3D convolutions have limited temporal receptive fields and therefore cannot adequately capture variable temporal dependencies. On the other hand, a few recent works attempt to explicitly model temporal relationships and demonstrate promising results in several tasks, to name a few, temporal relational reasoning [40]- [44], object detection and tracking [6]- [9], event recognition [45]- [47], video segmentation [48]- [50], dynamic texture recognition [51], and spatiotemporal learning [52], [53].
A video delivers not only spatial information but also temporal dynamics. Hence, some studies are dedicated to capture spatial (appearance) and temporal (motion) representations separately by a two-stream architecture. In these two-stream models, fusing the features from two pathways is an important procedure for recognition. For example, [24] directly fuses the softmax scores using either averaging operation or a simple linear SVM. In [54], the authors utilize a fully connected layer to merge the two streams of the late fusion model. However, its performance is surpassed by a purely spatial network. Additionally, [55] introduces residual connections between appearance and temporal streams to enable motion interactions. For stream fusion, the authors average the prediction scores of the classification layers from two streams. In [30], the authors investigate several fusion methods such as max, concatenation, convolution, and observe that 3D convolutional fusion outperforms averaging the softmax output. The main limitation of the two-stream architecture is that it is not capable to spatiotemporally match spatial and temporal information. Therefore, a fusion method is needed to spatiotemporally register the features from two pathways. However, the abovementioned fusion methods leverage a single operation (e.g., averaging) that is not able to effectively enable spatiotemporal interactions between them.
The motion in aerial videos usually has different durations and shows a high variability. For example, in the ERA dataset [18], mudslide shows a simple and repeated motion over a long duration, which could be described by a few video frames; car racing depicts a complicated, dynamic process and is composed of a variety of consecutive motions including chasing, approaching, away, colliding, etc. over a short duration. Temporal relations across multiple frames are an important cue to represent the complex motion. The aforementioned approaches based on spatiotemporal convolutions (e.g., 3×3×3 convolutions) simply add a temporal dimension to 2D convolution filters to implicitly learn temporal dependencies, and they are not adaptable to capture various, complicated temporal dynamics over a long duration due to their limited temporal receptive fields. To address this issue, we propose to explicitly learn temporal relations across arbitrary frames to effectively model long-term temporal dependencies. Furthermore, we introduce multi-scale temporal relations into holistic features to design a two-pathway architecture for aerial video classification. Besides, for spatiotemporal registering temporal relations and holistic features, we propose a novel fusion module in which holistic features are spatiotemporally modulated with temporal relations.
In this paper, we present a two-pathway network, termed FuTH-Net (Fusing Temporal relations and Holistic features for aerial video classification). One pathway is devised to capture a holistic feature describing appearances and shortterm temporal variations. The other pathway is responsible for excavating temporal relations across arbitrary frames at multiple timescales, providing long-term temporal dependencies. Last but not least, for spatiotemporally fusing two features from two pathways, we further present a novel fusion module in which the multi-scale temporal relations are leveraged to refine the temporal features in the holistic representation. More specifically, we learn the holistic feature by treating a video as an entirety and using inflated 3D convolution operators [35]. Meanwhile, we sample frame-level feature vectors at different sampling rates to learn multi-scale temporal relations with a sequence of multilayer perceptrons (MLPs) [56]. As to the fusion of these two features, we employ a fusion module in which the temporal relations are modulated with the holistic representation by a normalization-like process [57], [58]. The resulting feature representation is then fed into the following layers for the purpose of video classification. Contributions of this paper are threefold: • We propose a novel network, namely FuTH-Net, for the task of aerial video classification. This network exploits a two-pathway architecture, one for learning a video presentation holistically and the other for fully excavating useful temporal relations at multiple timescales among video frames. • A novel fusion module exploits a normalization-like pipeline in which the two features learned from two pathways are spatiotemporally registered by modulating the holistic features according to temporal relations. In this module, the temporal information in holistic features is refined by multi-scale temporal relations. A more discriminative fused feature is obtained for distinguishing different video events. • We evaluate the effectiveness of the proposed network through extensive experiments, and experimental results show that our method achieves the state-of-the-art performance. The remaining sections of this paper are organized as follows. Section II details the architecture of FuTH-Net, and Section III shows and discusses experimental results. The conclusion is drawn in Section IV.

II. NETWORK ARCHITECTURE
In this section, we detail our proposed network architecture, FuTH-Net, for aerial video classification. First, we introduce an overview of the proposed network in Section II-A. Furthermore, we give more detailed descriptions for two modules, temporal relation block and fusion module, in Section II-B and II-C. Finally, the implementation of our network is introduced in Section II-D. (2) The lower pathway, namely temporal relation pathway, aims to learn a multi-scale temporal relation bank l by a temporal relation block.
(3) A followed fusion module combines the outputs of two pathways to generate a robust fused feature z which is finally fed into a fully-connected layer for aerial video classification.  The motivation of our network is to simultaneously model the holistic feature and temporal relations of a video with a two-pathway architecture. The resulting two feature representations are integrated by a fusion module. The overview of the architecture is illustrated in Fig. 1.

A. FuTH-Net
Holistic representation pathway treats a video as an entity and aims at learning a holistic feature by 3D convolutions. 3D convolution is achieved by endowing 2D convolution with an additional dimension (e.g., the temporal dimension of aerial videos), which is illustrated in Fig. 2. Compared to 2D convolution, 3D convolution is able to capture both spatial and temporal information, so called holistic representation in our case. It is of importance for video classification under some circumstances where events with simple temporal dynamics are strongly associated with certain objects or scenes. As to the implementation of 3D convolutions, many efforts, e.g., 3D convolutional kernel [31], inflated 3D convolution [35], and pseudo 3D convolution [36], have been made to symmetrically extract both spatial and short-term temporal information. In this work, we choose a typical 2D CNN architecture and transform all 2D operations to 3D operations by a specific 3D implementation method [35]. Then, we employ the transformed 3D CNN with a bunch of 3D convolution and pooling operations on a video volume to capture a holistic representation g. Temporal relation pathway views a video as a sequence of frames and aims to capture temporal relations across multiple frames by a temporal relation block. Temporal relation information is vital for video classification, as it is capable of capturing high-level interactions among entities (subjects, objects, scenes, etc.) over a long temporal series, which are significant for recognizing events with complex temporal dynamics. To take advantage of this cue, we apply a 2D CNN to video frames to extract appearance features. Then, these features are fed into the temporal relation block to learn a multi-scale temporal relation bank l across arbitrary frames.
Fusion module combines outputs of the two pathways to build a more discriminative representation. More specifically, it leverages a normalization-like pipeline in which the temporal relations are transformed to two modulation parameters by two affine transformations, and the produced parameters F 1 (l) F 2 (l) multiplied and added with the holistic feature g to yield the normalized activation element-wisely. Finally, the fused feature z is obtained by concatenating the normalized activation with an additional holistic feature g.
In what follows, we detail the temporal relation block and fusion module.

B. Temporal Relation Block
The purpose of temporal relational reasoning lies in linking meaningful transformations among entities over time. [59] is intended to construct a fully connected graph among entities in video frames and calculate pairwise energy functions among node pairs in the graph to model temporal relations. Inspired by this work, we aim at capturing temporal relations among arbitrary frames. Instead of utilizing a fully connected graph among video frames which inevitably increases computation and redundancy, we make use of a sampling strategy to sample multiple snippets and learn relational representations using a group of multilayer perceptrons (MLPs). Note that each sampled snippet contains a variable number of frames for the purpose of learning multi-scale relational representations.
Formally, suppose that we have extracted an appearance feature set V = {f 1 , f 2 , ..., f N } of video frames by a 2D CNN, where f i denotes the 256-dimensional feature vector of the i-th video frame, and N is the number of frames. We randomly sample m vectors from V and concatenate them to s m , where m is the total number of sampled frames, the length of vector s m is m × 256. Notably, before concatenation, we rearrange sampled vectors according to the original temporal order. The corresponding m-frame relation function is defined as below: where the input is the concatenated vector The temporal relation block is a basic computational unit with an input feature set V and an output temporal relation bank l, and can be easily plugged into any classification CNN models. Fig. 3 illustrates the structure of our temporal relation block.

C. Fusion Module
Outputs from the holistic representation pathway and temporal relation pathway are integrated by a fusion module that encodes spatiotemporal correspondences between holistic features and temporal relations. Spatiotemporally registering the two features is vital for encoding spatiotemporal correspondences. Motivated by conditional normalization [57], [58], we present a novel fusion module where the two features are spatiotemporally registered by modulating the holistic features according to temporal relations. The multi-scale temporal information is leveraged to refine the temporal representations in holistic features. Specifically, the module utilizes a Two affine transformations are applied on the temporal relation bank l to produce two vectors, F 1 (l) and F 2 (l), respectively. Afterwards, the Hadamard product and addition operation are applied on them with g. Finally, the output vector is concatenated with g to yield the fused feature z.
normalization-like pipeline in which the temporal relations are transformed to two modulation parameters by two affine transformations, and the produced parameters F 1 (l) and F 2 (l) are multiplied and added with g to yield the normalized activation element-wisely. Finally, the fused feature z is obtained by concatenating the normalized activation with an additional holistic feature g. The fusion equation is as follows: where F 1 and F 2 are affine transformations and aim to produce the modulation parameters, denotes a Hadamard production, and [·, ·] denotes concatenation. The overall structure of fusion module is illustrated in Fig. 4. We concatenate an additional holistic feature g with the modulated feature to yield the final fused feature z. This is for enriching the spatial information that is important for distinguishing events with simple dynamics. For validating its effectiveness, We further compare it with several existing fusion methods in ablation study (See Section III-B).

D. Implementation Details
In this subsection, we describe the implementation of our FuTH-Net.
Holistic representation pathway. We convert a typical image classification architecture, Inception-v1 [60], into a 3D architecture by inflating all convolutions and pooling filters. The 3D convolutions are created by endowing 2D ones with an additional temporal dimension. Furthermore, we would like to bootstrap the 2D network weights pretrained on ImageNet into the 3D model. To achieve this, the 3D model could be implicitly pretrained on ImageNet by converting images into fixed videos. We replicate weights of 2D convolutions N times along the temporal dimension and then divide them by N to produce pretrained parameters for the 3D model. Moreover, we optimize hyperparameters for convolutions and pooling operations (e.g., stride and pooling size) to effectively capture representative temporal dynamics. In detail, we use 1 × 3 × 3 kernels with 1 × 2 × 2 strides in the first two max pooling  layers for remaining initial temporal information. The final average pooling layer exploits a 2 × 7 × 7 kernel to produce a 1024-dimension feature vector which is regarded as the holistic representation g. Temporal relation pathway. We utilize Inception-v1 with batch normalization pretrained on ImageNet as our feature extraction model to generate a 1024-dimension feature vector for each frame. Subsequently, a feature bank with the size of n × 1024 for an input video is produced, where n is the number of input video frames. Moreover, φ m is a twolayer MLP with 256 units, and each layer is followed by a batch normalization [57] layer and a ReLU activation function. (N − 1) temporal relations are extracted by φ m and then concatenated into the final multi-scale temporal relation bank with the dimension of 256 × (N − 1). The number of input frames is set to 16 in both two pathways.
Fusion module. Two simple MLP with dropout operations are exploited to implement the two affine transformations which are employed on l to yield two 1024-dimension vectors, F 1 (l) and F 2 (l). The final fused feature is a 2048-dimension vector.
Training schedule. The network is trained on PyTorch 1 framework and runs on one NVIDIA Tesla P100 GPU 2 with 16 GB on-board memory. We train our model with a stochastic gradient descent (SGD) [61] optimizer using a momentum of 0.9 and a weight decay of 0.0005. Due to the limitation 1 https://pytorch.org/ 2 https://www.nvidia.com/en-us/data-center/tesla-p100/   of GPU memory, we utilize a multi-stage training strategy. Specifically, the whole training procedure is composed of three phases. First, we train the holistic representation pathway for 100 epochs with a batch size of 6 and a learning rate of 0.001. Then, we train the temporal relation pathway with a learning rate of 0.0001 and the same epochs and batch size, while keeping weights of the holistic representation pathway fixed. Finally, the fusion module is trained for 120 epochs with weights of two pathways fixed.

III. EXPERIMENTS
In this section, we first introduce aerial video recognition datasets, competitors, and evaluation metrics in Section III-A. Then, we perform ablation studies to investigate the complementarity between the holistic representation pathway and temporal relation pathway as well as the effectiveness of our fusion module in Section III-B. Furthermore, we assess the performance of our FuTH-Net on two different aerial video recognition datasets, ERA and Drone-Action, and analyze experimental results in Section III-C and III-D, respectively.

A. Experimental Setup
Datasets. To evaluate the performance of FuTH-Net, we conduct experiments on two aerial video recognition datasets with standard evaluation protocols. Firstly, we use the ERA  in Time dataset as initialization. 5 SlowFast † is trained from random initialization, without using pre-training. 6 Multigrid † use ImageNet-pre-trained for 3D convolutions inflated from 2D convolutions following common practice. dataset [18] which is an event recognition dataset and consists of 2864 aerial event videos collected from YouTube. In this dataset, 25 events are defined, including post-earthquake, flood, fire, landslide, mudslide, traffic collision, traffic congestion, harvesting, ploughing, constructing, police chase, conflict, baseball, basketball, boating, cycling, running, soccer, swimming, car racing, party, concert, parade/protest, religious activity, and non-event (see Fig. 5). Then, the Drone-Action dataset [62] for human action classification in aerial videos is utilized to further assess the performance of models. In this dataset, 240 self-taken aerial videos are collected, and 13 different actions are defined: kicking, walking front/back, running side, jogging side, walking side, hitting stick, running front/back, stabbing, jogging front/back, clapping, hitting bottle, boxing, and waving hands (see Fig. 6). Table I exhibits details of the two datasets.
In the preprocessing phase, we transform video clips of the Drone-Action dataset into the same data structure as the ERA dataset. Since durations of videos in the Drone-Action dataset range from 5 to 21 seconds, we cut them to 5-second clips. Afterwards, each frame is cropped and resized to a size of 640 × 640. For both datasets, we sample 16 frames from each video clip with a fixed sampling rate.
Competitors. We compare the proposed network with several state-of-the-art video classification models.
• C3D [31]. C3D (3D convolutional network) aims to extract spatiotemporal features with 3D convolutional filters and pooling layers. Compared to conventional 2D CNNs, 3D convolutions and pooling operations in C3D can preserve the temporal information of input signals and model motion as well as appearance simultaneously. Moreover, authors in [31] demonstrate that the optimal size of 3D convolutional filters is 3×3×3. In our experiments, we test two C3D 3 networks with pre-trained weights on the Sport1M dataset [21] and the UCF101 dataset [63] (see C3D † and C3D ‡ in Table II), respectively. • P3D ResNet [36]. P3D ResNet (pseudo-3D residual network) is composed of pseudo-3D convolutions, where conventional 3D convolutions are decoupled into 2D and 1D convolutions in order to learn spatial and temporal information separately. With such convolutions, the model size of a network can be significantly reduced, and the utilization of pre-trained 2D CNNs is feasible. Besides, inspired by the success of ResNet [19], P3D ResNet employs ResNet-like architectures to learn residuals in both spatial and temporal domains. In our experiments, we test two 199-layer P3D ResNet 4 (P3D-ResNet-199) models with pre-trained weights on the Kinetics dataset [64] and the Kinetics-600 dataset [65] (see P3D † -ResNet-199 and P3D ‡ -  in Table II), respectively. • I3D [35]. I3D (inflated 3D ConvNet) expands 2D convolution and pooling filters to 3D, which are then initialized with inflated pre-trained models. Particularly, weights of 2D networks pre-trained on the ImageNet dataset are replicated along the temporal dimension. With this design, not only 2D network architectures but also pretrained 2D models can be efficiently employed to increase the learning efficiency and performance of 3D networks.
To assess the performance of I3D on our dataset, we test 3 https://github.com/tqvinhcs/C3D-tensorflow 4 https://github.com/zzy123abc/p3d two I3D 5 models whose backbones are both Inception-v1 [60] (I3D-Inception-v1) with pre-trained weights on the Kinetics dataset [64] and Kinetics+ImageNet, respectively (see I3D † -Inception-v1 and I3D ‡ -Inception-v1 in Table II). • TRN [43]. Temporal relation network (TRN) is proposed to recognize human actions by reasoning about multiscale temporal relations among video frames. By leveraging the proposed plug-and-play relational reasoning module, TRN can even accurately predict human gestures and human-object interactions through sparsely sampled frames. For our experiments, we test TRNs 6 with 16 multi-scale relations and select the Inception architecture as the backbone. Notably, we experiment two variants of the Inception architecture: BNInception [66] and Inception-v3 [67]. We initialize the former with weights pre-trained on the Something-Something V2 dataset [68] (TRN † -BNInception in Table II) and the latter with weights pre-trained on the Moments in Time dataset [69] (TRN ‡ -Inception-v3 in Table II). • SlowFast [46]. SlowFast network is a two-pathway architecture in which a Slow pathway is designed for operating at low frame rate to capture spatial semantic information, and a Fast pathway aims at operating at high frame rate to learn motion at fine temporal resolution. To assess the performance of SlowFast on our dataset, we test one SlowFast 7 model (see SlowFast † in Table II) Table II) with ImageNet-pre-training. Evaluation metrics. We make use of the per-class precision, overall accuracy, confusion matrix and kappa coefficient as evaluation metrics for comparing the performance of different models. Specifically, the pre-class precision is calculated with the following equation: The overall accuracy is computed by dividing the number of correctly classified test samples with that of all test samples. Moreover, the confusion matrix is visualized to illustrate the classification performance of variant models. Each element of the matrix denotes the number of instances that belong to the ground-truth class (X-axis) but are classified as the predicted class (Y-axis). For an explicit visualization, we normalize the confusion matrix by dividing each element with the sum of each row. In addition, the kappa coefficient is leveraged to evaluate consistency and classification precision. It considers both the overall accuracy and the variations in the number of samples in each category.

B. Ablation Studies
To evaluate the complementarity between two pathways and effectiveness of the fusion module, we conduct ablation studies on the ERA and Drone-Action datasets.
Complementarity. We investigate the complementarity by comparing our FuTH-Net with its single-pathway versions on the EAR dataset. Specifically, instead of simultaneously utilizing both pathways, Holistic-only and Relation-only make use of holistic representation and temporal relation pathways, respectively. For a comprehensive study, we compare these models under variant video sampling strategies. As shown in Fig. 7, we sample 4, 8, 12, 16, and 20 frames from each video clip and show overall accuracies. It can be observed that FuTH-Net exhibits superior performance than the other two competitors under all sampling strategies. The combination of Model k ic k in g w a lk in g f r o n t/ b a c k r u n n in g s id e jo g g in g s id e w a lk in g s id e h it ti n g s ti c k r u n n in g f r o n t/ b a c k s ta b b in g jo g g in g f r o n t/ b a c k c la p p in g h it ti n g b o tt le b o x in g w a v in g h a n d s OA κ the two pathways brings in significant improvements, demonstrating that the multi-scale temporal dependencies captured by the temporal relation pathway are largely complementary with the holistic feature. Moreover, we note that Holistic-only outperforms Relationonly when 4 or 8 frames are used, but is surpassed by Relationonly with increasing frames. The reason could be that a few frames are not enough for the learning of multi-scale temporal relations. Another interesting observation is that the performance of Holistic-only deteriorates when the number of sampled frames is larger than 12, which might result from information redundancy. This also has a negative effect on FuTH-Net and brings a decrement of 2.3% with the number of sampled frames increasing from 16 to 20. At last, FuTH-Net reaches the best performance at 16 frames. In addition, we jointly leverage holistic spatiotemporal features and multi-scale temporal relations for video classification. For validating the effectiveness of this combination, we compare our model with other hybrid models (i.e., C3D+TRN, P3D+TRN, and I3D+TRN) on two datasets using two fusion methods, concatenation and our fusion module. The numerical results are reported in Table. III. We can observe that compared to other hybrid models, our FuTH-Net achieves the best performance with different fusion methods on two datasets. Moreover, we note that hybrid models with our fusion module outperform those with concatenation in general. Another interesting observation is that the three hybrid models do not achieve better performance than single models (i.e., TRN). For example, I3D+TRN with our fusion module achieves an OA of 60.8%, while TRN ‡ -Inception-v3 obtains an OA of 64.3%.
Fusion module. As an important component in our framework, the fusion module aims to integrate features from both pathways. To validate its effectiveness, we compare the fusion module with several commonly used integration operation, such as, max, average, concatenation, bilinear, sum, 2D conv, and 3D conv. Notably, for 2D and 3D convs, the input is the concatenation of feature maps from last convolutional layers of two pathways, respectively. Table V compares FuTH-Net to other models with different fusion modules on both the ERA and Drone-Action datasets. As can be seen in this Table, FuTH-Net provides better results than models with other different fusion methods and models with single pathways, which demonstrates that our fusion module can effectively encode high-level interactions between the two features and improve the performance. Moreover, we concatenate an additional holistic feature g with the modulated feature to yield the final fused feature z. For ablating this design, We concatenate different additional features, i.e., None and Temporal relation l, with the modulated feature to obtain the final fused feature z. We use these additional features to conduct ablation studies on different generations of the fused feature z. The numerical results are reported in Table. VI. We can observe that the model with holistic feature g as the additional feature outperforms other models. Richer spatial information introduced from the holistic feature can improve the discriminant ability for events with simple dynamics.

C. Results on the ERA dataset
We compare the proposed FuTH-Net and other competitors on the ERA dataset and report numerical results in Table II  As we can see, our model has a superb performance and provides an OA of 66.8% which is 1.5% higher than the second best model, Multigrid ‡ . And our model and Multigrid achieve the same best kappa coefficient (0.63). In addition, the per-class precision is also reported to evaluate the performance of different models on each class. In particular, our model achieves the highest per-class precisions for some challenging categories, such as concert (89.8%), car racing (84.2%), and parade/protest (65.3%). This is mainly because our FuTH-Net is able to capture complex dynamic information, which is crucial to distinguish events with insignificant inter-class variances. Taking concert and parade/protest (cf. the first row of Fig. 8) for example, they have something in common (e.g., crowd and street). However, the temporal dynamics of crowds in these two events are very different (concert: moving randomly or standing still; parade/protest: moving towards a certain direction). We can see that our FuTH-Net correctly predicts these two events. This also can be seen from Table  II that our network gains the highest precisions for these two classes, showing its effectiveness for temporal relational reasoning.
Moreover, the performance on class non-event can reflect whether a model can distinguish specific events from normal videos. Notably, our model produces the best precision (63.9%) for non-event, which illustrates that our method is able to capture discriminative spatiotemporal features for inferring the existence of events.
Finally, the confusion matrix in Fig. 9(a) shows more details. We can observe that some events including similar objects and scenes (e.g., "landslide vs. mudslide"; "traffic col-lision vs. police chase"; "harvesting vs. ploughing"; "concert vs. party") tend to be misclassified. Other competitors also suffer from this problem. Fig. 8 shows some predictions of FuTH-Net and the second best model (i.e., Multigrid). It can be observed that there are a lot of visual similarities existing in textures, objects, and scenes of these events.

D. Results on the Drone-Action dataset
This subsection compares FuTH-Net and state-of-the-art methods on the Drone-Action dataset, and quantitative results are reported in Table IV. Our FuTH-Net achieves the highest OA, 88.4%, and compared to SlowFast that is the second best model, an increment of 1.7% can be obtained. Moreover, our model achieves the best kappa coefficient (0.87).
Besides, it is interesting to note that FuTH-Net shows good performance in recognizing actions in which effectively sensing motion speeds is crucial for a successful prediction. For instance, the proposed network gains the highest precisions for walking side (100.0%), running side (58.8%), and jogging side (100.0%). To further illustrate this, we show some predictions of FuTH-Net and the second best model (i.e., SlowFast) in Fig. 10. As can be observed, the motion speeds of walking side and running side are variant, and our FuTH-Net succeeds in identifying them with high confidences. The bottom right example shows that running front/back is misclassified by both FuTH-Net and SlowFast, owing to that human poses and motion speeds are very similar in this angle of view. Furthermore, the confusion matrix of the proposed network on the Drone-Action dataset shown in Fig. 9(b) also suggests that running front/back is easily misidentified as jogging front/back.

IV. CONCLUSION
In this paper, a novel method is proposed to learn feature representations from aerial videos using a two-pathway network, termed as FuTH-Net. Specifically, the proposed network exploits inflated 3D conclusions to capture a holistic feature on a holistic representation pathway. Simultaneously, a temporal relation block learns temporal relations across multiple frames on a temporal relation pathway. A novel fusion module is applied to fuse outputs from the two pathways for producing a more discriminative video representation. Furthermore, we conduct extensive experiments on two aerial video recognition datasets, ERA and Drone-Action. On the one hand, we perform ablation studies to validate the complementarity between the two pathways as well as the effectiveness of the proposed fusion module. On the other hand, we compare our model with other state-of-the-art methods. Experimental results demonstrate that the introduction of the temporal relation pathway can enhance the ability of capturing representative temporal relations. Besides, our fusion module is capable of learning highlevel interactions between the holistic features and temporal relations to further boost the performance. The outstanding performance on the two datasets further illustrates the superior capability of FuTH-Net for remote sensing video recognition and its powerful generalization capability across different tasks (event classification and human action recognition).