Learning Long-Term Temporal Features With Deep Neural Networks for Human Action Recognition

One of challenging tasks in the field of artificial intelligence is the human action recognition. In this paper, we propose a novel long-term temporal feature learning architecture for recognizing human action in video, named Pseudo Recurrent Residual Neural Networks (P-RRNNs), which exploits the recurrent architecture and composes each in different connection among units. Two-stream CNNs model (GoogLeNet) is employed for extracting local temporal and spatial features respectively. The local spatial and temporal features are then integrated into global long-term temporal features by using our proposed two-stream P-RRNNs. Finally, the Softmax layer fuses the outputs of two-stream P-RRNNs for action recognition. The experimental results on two standard databases UCF101 and HMDB51 demonstrate the outstanding performance of proposed method based on architectures for human action recognition.


I. INTRODUCTION
Human action recognition in video is an important and focused research topic with various useful applications, such as intelligent video surveillance, video retrieval, humancomputer interaction and smart home appliance [1]- [3]. Due to background clutter, lighting conditions, partial occlusion and viewpoint change, action recognition is limited [4]. Similar to other vision problems, effective visual features of human action in video are crucial for action recognition [5], [6]. For example, Figure 1 shows two examples from the HMDB51 dataset [7]. The appearance feature in each frame is not sufficient to differentiate the class.
The feature representation of human action recognition can be roughly divided into two types. One is based on hand-crafted features, such as Histogram of Gradient (HOG) [8], Histogram of Optical Flow (HOF) [9] and Improved Dense Trajectories (IDTs) [10]. Because of strong performance on the image recognition, IDTs features The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . Exaples from the HMDB51 dataset [7]. The groundtruth for the two samples are ''sit'' and ''stand''. Temporal information is crucial to correctly determine these two classes.
with fisher Vector (FV) [11] or Vector of Locally Aggregated Descriptors (VLAD) [12] have been applied onto action recognition in video [10], [13], [14], which have achieved the state-of-the-art results. Richard and Gall [15] proposed RNNs-based encoding method [16] to aggregate local feature descriptors for action recognition. Another is deep learning-based technique. Similar with many computer vision tasks, the progress on human action recognition is significantly advanced by deep learning techniques.
learning architecture for capturing video representation is also proposed by using LSTM model [48].
In this paper, we experimentally evaluate the proposed method to learn richer semantic features and model longerterm temporal information in video for action recognition. Our main contributions can be summarized as follows: First, we introduce residual learning into the recurrent structure and propose Pseudo Recurrent Residual Neural Networks (P-RRNNs) to model long-term temporal features. Our model outperforms other RNNs based architectures in action recognition tasks. Meanwhile, the number of model parameters is greatly reduced. Second, different from most of approaches that extract temporal features from sample video frames, we directly learn spatial and long-term temporal features from holistic video clip to depict the human action information, which is able to obtain more robust visual features. Third, we combine our P-RRNNs features with IDT features for action recognition. The experimental results prove the complimentary of both features. Figure 2 summarizes the pseudo recurrent residual neural network architecture in the study.
The paper is organized as follows: in next section, we review literatures related with the presented works. In section III, we elaborate on the details of the proposed P-RRNNs for action recognition in video. In section IV, we analyze the performance of the approach. Finally, the conclusion of the work is in Section V.

II. RELATED WORKS
Human action recognition has been studied by researchers for decades. Early studies [49], [50] mainly focus their research VOLUME 8, 2020 on simple actions with flat background, such as hand clapping, boxing and walking. With rapid developments in local feature extracting and encoding, action recognition is gradually towards practical applications [10], [13], [51], [52]. According to deep neural networks, various reported methods are categorized into hand-crafted feature-based methods and deep learning-based methods.

A. HAND-CRAFTED FEATURES FOR ACTION RECOGNITION
Early methods [53]- [55] for action recognition in videos primarily focus on hand-crafted features, which described appearance and motion information by using a number of local features. Local features are effective tool in image recognition, which represent image through feature descriptors, such as Scale-Invariant Feature Transform (SIFT) [56], Speeded Up Robust Features (SURF) [57]. Inspired by success of image recognition, the researchers directly extend the image classification methods to learn spatio-temporal information of video for action recognition. In [54], HOG descriptor is extended to Histograms of Oriented 3D spatiotemporal Gradients (HOG3D) for action recognition in video. Inspired by the Harris corner descriptor [58], Harris3D [55] is proposed to encode the region of the interest (ROI). Extended from SIFT, SIFT-3D [53] is proposed to represent the spatiotemporal motion features for action recognition. Recently, one effective approach is Dense Trajectories (DT) [14], which consist of HOG, HOF, and Motion Boundary Histogram (MBH). IDTs features [10] take camera motion into account to improve performance of action recognition based on DT approach. Furthermore, Crmona and Climent [59] combine IDTs and subtensor projections to depict the human action. Whereas, to extract IDTs features has much higher computational complexity and intractable on large scale data sets.
By adopting encoding methods, such as Bag of Visual Words (BoVW) [60], FV, or VLAD, local descriptors can be embedded into a global video-level feature vector for action recognition. Comparing with BoVW, the FV and VLAD can process statistics analysis of high order local features, and obtain noticeable higher accuracy. But these encoding methods obviously lead to the loss of temporal order of local features in the video. Meanwhile, graphical models, such as conditional random fields [61], are well-known approaches to extract the long-term temporal features for action recognition.

B. DEEP LEARNING FOR ACTION RECOGNITION
Many advances of action recognition in video are inspired by success on image classification [62], [63]. The breakthrough of image domain also rekindled the focus on deep learning for video recognition. The CNNs models play significant roles in the image classification and achieves state-of-the-art results. Extending the 2D convolutional filter, Ji et al. [35] proposed 3D CNNs to address video features from both the spatial and temporal dimensions. Karpathy et al. [32] built a largescale dataset of action recognition, which consists of 1 million video clips belonging to 487 classes, namely Sport1M, and proposed various ways to fuse motion information into the current CNNs model. Tran et al. [37] applied convolutional 3D (C3D) on 16 consecutive frame to learn motion and appearance information for action recognition, and experimentally found 3 × 3 × 3 convolutional kernel obtained the highest accuracy. Varol et al. [64] proposed Long-term Temporal Convolution (LTC) models to expand temporal length of inputs. Simonyan and Zisserman [29] first proposed a twostream CNNs architecture, which applies two networks for extracting appearance and motion features from two information streams, and fuse them by using average pooling or a linear SVM. Yang et al. [65] and Shi et al. [66] proposed additional information sources to learn richer appearance and motion features for action recognition based on twostream framework. Duta et al. [67] proposed Spatio-Temporal VLAD (ST-VLAD) to integrate spatio-temporal features. Kar et al. [25] found that only a small part of frames play a crucial role to discriminate an human action class, and they proposed a temporally pooling frames method to filtrate spatio-temporal action attention components. Wang et al. [28] proposed a temporal segment network (TSN) architecture that incorporates sparse temporal sampling and video-level supervision to learn more proper long-term temporal features. An [68] used restricted Boltzmann machine as feature encoder to encode the spatial and temporal features and a SVM classifier is applied to recognize human action in video. Cherian et al. [69] explored generalized rank pooling (GRP) to preserve video frames temporal order information for improve action performance. Choutas et al. [70] proposed pose motion (PoTion) forms to recognize video action. However, the PoTion method needs to be combined with I3D [26] to obtain high recognition accuracy. Meantime, it is affected by the human pose estimator.
Recently, with the improvement of ResNet [21], [71] for image classification, Feichtenhofer et al. [72] proposed a spatio-temporal ResNet (ST-ResNet) that associates ResNet with the two-stream CNNs. To effectively learn spatiotemporal features, they apply a residual connection from the spatial stream to the temporal stream. Meantime, inspired by the success of recurrent neural networks in sequential information modeling [73]- [76], many researchers [42], [44], [45], [48], [77], [78] propose LSTM model for action recognition. Ng et al. [44] and Donahue et al. [45] extracted framelevel features of video by using CNNs model, and train LSTM with the frame-level feature for direct video-level prediction. Srivastava et al. [48] proposed an approach for learning the sequence information in unsupervised settings by using LSTM architecture. To mitigate the overfitting problem, Yu et al. [42] proposed a single-layer LSTM frameworks for learning long-term motion features. To learn spatio-temporal information, Zhang et al. [79] proposed multi-level recurrent residual networks to produce complementary representations for action recognition. The recurrent residual model is also use of temporal skip connection.
Among these RNN-based approaches, the recurrent residual networks are [44], [45], [79] closely related to us. In [44] and [45], the deep learning model is built by CNNs and LSTMs, which feeding the output of CNNs into LSTMs. In contrast, we introduce pseudo residual learning into RNNs, which can design deeper model to learn richer semantic features for action recognition. In [79], three stream multilevel recurrent residual networks are proposed for action recognition. Each stream consist of ResNets and a recurrent model. However, the proposed network mainframe fuses pseudo residual learning into recurrent neural networks to learn longterm temporal features.
Motivated by above analysis, we propose a pseudo recurrent residual neural network, which considers skip one or more hidden layers in RNNs. The model can learn richer action semantic features and long-term motion information to classify action recognition in video.

III. METHODS
In this section, we describe the key components of the P-RRNNs, including LSTM and GRU architectures, and pseudo recurrent residual network architectures. Figure 3 is a LSTM memory block with one cell, where ⊗ represents multiplication, and dashed lines represent weight between the cell to the gates. All other connection weights within the block are fixed to one. The LSTM architecture uses a set of memory blocks which each block contains one or more self-connected memory cells, and three multiplicative units: the input, output and forget gates. Three gates provide continuous analogues of write, read and reset operations for the cells. The gate activation function usually is the logistic sigmoid. The cell input and output activation functions are hyperbolic tangent or logistic sigmoid. Based on the specialized memory architecture, LSTM is able to effectively tackle the vanishing and exploding gradient problem. Assuming that  [80]. feature, at time step t, for all LSTM neurons in some layer, activations are computed as follows:

A. LONG SHORT-TERM MEMORY
where σ is the logistic sigmoid function, denotes elementwise multiplication, i t , f t , o t and c t are the input gate, forget gate, output gate and memory cell activation vectors, respectively, b i , b f , b o and b c denote the bias terms, W αβ is the weighted matrix between α and β, such as W xi is the weighted matrix from the inputs x t to the input gates i t .

B. GATED RECURRENT UNIT (GRU)
The GRU architecture is a simplified variant of the LSTM architecture [80], in which coupled the input and the forget gate into an update gate [81]. Compared to three units of LSTM architecture, the GRU reduce the gating signals to two. The GRU architecture is showed in Figure 4, which consists of an update gate z and a reset gate r. The update gate moderates the rate at which the information at the previous moment is allowed to enter the current state. Oppositely, the reset gate is applied onto controlling how much status information of the previous moment can be ignored. At time step t, the information of the forward propagation can be computed as follows: where is represents the multiplication of the corresponding elements of two matrices. W o ∈ R h×y is the weight matrix between input and output layer, h and y is number of nodes in hidden layer and output layer, respectively, W ∈ R x×h is represents the connection weight matrix of the input layer to the update gate, and x is the feature dimension of the input feature vector. U z ∈ R h×h is the weight matrix between the VOLUME 8, 2020 hidden layer and the updated gate of the previous moment. W r ∈ R x×h and U r ∈ R h×h are the connection weight matrixes of the input layer and hidden layer of the previous time to the reset gate, respectively. W ∈ R x×h and U ∈ R h×h are also the connection weight matrixes of the input layer and hidden layer of the previous time to the candidate state h .

C. RESIDUAL NETWORKS
He et al. [21] proposed residual learning in CNNs architecture for image classification. Let H (x) indicate the desired mapping. The theory of ResNet is to account for the mapping of the learned function from one layer to another as H (x) = F(x) + x, where x is original input, and F(x) is residual function. By using the spatial skip connection, the input feature x is directly forwarded and added to the next layer, it only remains to approximate the residual functions

D. PSEUDO RECURRENT RESIDUAL NEURAL NETWORKS (P-RRNNs) ARCHITECTURE
Inspired by the success of ResNets [21] in image recognition tasks, and RNNs are good at processing sequential information, we design a novel pseudo recurrent residual neural networks to pursue spatio-temporal feature for action recognition. More specifically, we integrate the residual learning into the recurrent neural network and propose pseudo recurrent residual recurrent neural network architectures. In our studies, we design three P-RRNNs architecture variants, as show in Figure 5. Figure 5(a) illustrates an unfold two-stream P-RRNNs with skip connections over time. Therefore, the l th hidden layer receives the feature-maps of the input and upper preceding layers, x 0 , x l−1 , as input: [x 0 , x l−1 ] refers to the concatenation of the feature-maps produced in layer 0, l − 1, and l > 1. Figure 5(a) depicts the video frame feature stream and optical flow feature stream. Blue circle indicates input feature vectors of the two-stream RRNNs at time t. Orange circles indicates hidden layers, the number of hidden layers of each stream is set to 3. Meantime, the number of hidden units is set to 512 [46]. C represents the concatenation of two vectors. The action recognition is achieved by merging the outputs of the twostream with the Softmax layer. Such concatenation of two vectors can be regarded a pseudo residual connection, and the network architecture regarded as an pseudo recurrent residual neural networks. We named this recognition network structure as an Input P-RRNNs (IP-RRNNs).  Figure 5(b) is a Cross-layer P-RRNNs (CP-RRNNs), the l th hidden layer receives the feature-maps of the last two preceding layers, x l−2 , x l−1 , as input: where[x l−1 , x l−2 ] refers to the concatenation of the featuremaps produced in layer l − 1, l − 2, and l ≥ 2. Figure 5(c) are Recurrent Residual Neural Networks (RRNNs). The residual connection is similar to the deep residual CNNs, and ⊕ represents the addition of two feature vectors.

IV. EXPERIMENTS
In this section, we demonstrate extensive experimental effectiveness of pseudo recurrent residual neural network on the HMDB51 and UCF101 datasets. Firstly, we introduce two challenging human action benchmark datasets and network training in Section IV-A. We study the recurrent units of P-RRNNs in Section IV-B and the RNN architectures in Section IV-C. Next, we evaluate in Section IV-D the impact of different residual architectures. Then, we show effect of multi-stream fusion architecture in Section IV-E and computational costs in Section IV-F. Finally, we compare our method with the state-of-the-art in Section IV-G.
We conduct experiments on Ubuntu 14.04 with Intel Core i7-7700, 32GB Memory, and a NVIDIA GTX Titan Graphics Card.

A. DATABASES AND IMPLEMENTATION DETAILS 1) DATABASES
UCF101 database [47] contains 13320 video clips that are downloaded from YouTube, and has 101 human action classes. The video clips were temporally trimmed and fixed frame rate of 25 FPS, and resolution of 320×240 respectively. Each action class is divided into 25 groups which contain 4 to 7 video clips. Following the literature, to ensure that video clips from the same video were not used for both training and testing, three train and test splits are used for action recognition on UCF101.
Moreover, HMDB51 database [7] contains 6766 videos divided into 51 human action categories, which is collected from a wide range of sources from digitized movies to online videos such as YouTube. For evaluation purposes, we follow the constraint that video clips in the training and testing set could come from different video file. Specifically, three distinct training and testing splits were generated from the database that 3570 clips in the training set and 1530 clips in the testing set.

2) IMPLEMENTATION DETAILS
Following TDD [82], we set the label between the video frames and video snippets from the video to be the same. We implement our network in Caffe. To alleviate the overfitting issue, we use two-stage training tactic to train our proposed networks. Firstly, we train the two-stream GoogLeNet models, and initialize GoogLeNet parameters of both streams with pre-trained models from ImageNet [41]. We build finetuned network by using Stochastic Gradient Descent (SGD) with a batch size of 128, and the momentum is set to 0.9. For spatial stream, the input is a single RGB frame image of size 224 × 224 × 3, and learning rate starts from 0.001, and decreases to its 0.1 every 2,000 iterations, stops at 10,000 iterations. For temporal stream, the input is a 224 × 224 × 2L volume, where L is the number of stacking optical flows. Meanwhile, we initialize the learning rate as 0.005, get reduced by a factor of 10 after 10,000 and 15,000 iterations, stops at 20,000 iterations. The dropout ratios are set to 0.5 [83] for both streams. We select softmax loss for GoogLeNet training.
At the second stage, we train the P-RRNNs with LSTM and GRU units from scratch for the temporal stream and spatial stream. The initialization of the P-RRNNs is important. To ensure that good hyper-parameters are used, we experiment with Gaussian distribution with mean of zero and several standard deviation σ = {0.1, 0.01, 0.001, 0.0001} on the split1 of UCF101. Table 1 reports results with different standard deviation of IP-LSTM and IP-GRU architectures. We can observer that via setting σ = 0.001 we achieve better performance. Therefore, we initialized weights of the P-RRNNs with LSTM and GRU from a Gaussian distribution with mean of zero and 10 −2 variance.The training parameters of both streams are the same. Specifically, the initial learning rate is set to 0.01 and get reduced by a factor of 10 after 2,000 and 5,000 iterations. The whole training procedure stops at 10,000 iterations. We use dropout of 0.6 for both streams [84]. Cross entropy loss is used for P-RRNNs. During training of P-RRNNs, the two-stream GoogLeNet parameters are fixed.
At test time, given a video, we input the video frame by frame to P-RRNNs, and the class scores for the whole video are then obtained. The maximum class score is the classification of the input video. For both databases, we select the same evaluation protocol. Three distinct training and testing splits are provided by the organizers. The performance is estimated by mean recognition accuracy across three splits.

B. EVALUATION ON RECURRENT UNITS IN P-RRNNs
Different types of recurrent units may significantly influence the complexity and performance of RNNs. Recently, GRU is proposed and become one of the most commonly used recurrent units in RNNs. Therefore, in this section we focus on two types of recurrent units: LSTM units and GRUs. We measure these recurrent units on the task of action recognition on the UCF101. More specifically, we employ IP-RRNNs architecture with 512 hidden units [46]. Similarly, the number of layers is set to three. Table 2 compares the accuracy of our method with GRU and LSTM architectures. The results demonstrate the LSTM units that obtaining higher performance, in which gains 2.8% compare to GUR units on the  UCF101 dataset. The reason that the GRU model simplifies the network architecture, and reduces feature learning ability for action recognition. Consequently, we choose LSTM architecture to learn long-term action feature in the remainder if there is no special explanation.

C. EVALUATION ON DIFFERENT NUMBER OF HIDDEN LAYERS OF P-RRNNs
In deep neural network architecture, each layer can carry different recognition information. But it is easy to appear of overfitting with the number of layers increased. In this section, we investigate the networks with number of 1 to 5 hidden layers, and each layer with 512 hidden units. Figure 6 reports action recognition results with different number of hidden layers of IP-RRNNs. The discrimination increases with the increase of hidden layer when the number of hidden layers is less than 4. It indicates that in these experiments, IP-RRNNs can gain recognition accuracy from increased depth, and number of 3-layers architecture obtains the most discriminative performance than others. The recognition accuracy is degraded when the layer of the network more than three. The reason may be that the model overfitting the training dataset. Therefore, we use 3-layers IP-RRNNs architecture in the remainder of this paper.

D. EVALUATION ON DIFFERENT P-RRNNs MODELS
In these experiments, we study the effect of different P-RRNNs models. We already investigated the network with LSTM units better than GRU architecture for action recognition. In this set of experiments, we also further explore advantage of the LSTM architecture compare to GRU. Table 3 shows the accuracy on the UCF101. The IP-RRNNs with LSTM (IP-LSTM) model obtains the highest average accuracy 88.5%. The IP-LSTM model strengthen feature propagation to increase representational power for action recognition. Meanwhile, the results show that the recognition accuracy of LSTM is also higher than that of GRU architecture, which again proves that LSTM architecture is more suitable for human action recognition in video. Therefore, considering the performance of the model, we select IP-LSTM architecture for action recognition in the remainder of this paper.

E. EVALUATION ON EFFECT OF HYBRID NETWORK MODELS
In this subsection, we analyze the benefit of two different networks' fusion combining the IDT model and IP-LSTM model. The hybrid network is used to learn hand-crafted features and deep learning features respectively. Specifically, IDT features extended local features such as HOF, HOG, and MBH, which depict spatial and short-term motion related features in video. The IP-LSTM extract long-term temporal feature. Possible reason could be both handcrafted and deep learning features that play important role in action recognition in video. Therefore, we choose average fusion method to obtain the probabilities of each video.
Firstly, we illustrate the split accuracy of each action category from UCF101 in Figure 7. The accuracy of each category is applied to analyze the performance of IP-LSTM, and then its impact on the recognition accuracy after fusion with the IDT features. On the UCF101 dataset, our IP-LSTM perform perfectly on most categories such as ''Baby Crawling'' and ''Golf Swing''. The results also show that the improvement is obviously for most action classes, like ''Apply Eye Makeup'', ''Archery'' and ''Baby Crawling''. On the other hand, there are some categories in the IP-LSTM model with a slightly lower accuracy than the IDT stream, such as ''Hammer throw'', ''Juggling balls'' and ''Boxing speed bag''. We find these action categories hold the characteristics of fast behavioral movements, which facilitate the extraction of IDT features and effectively describe behavioral actions.
The Table 4 shows the average accuracy of all action categories on the UCF101 dataset by fusing outputs of different streams. We can observe that the fusion model of IP-LSTM and IDT is able to significantly improve the performance and obtain a recognition accuracy of 91.4%. The results show that the IP-LSTM stream and IDT feature stream are complementary. Therefore, it is crucial to merger two types of features into successful action recognition system.  Finally, the superiority of pseudo recurrent residual architecture is illustrated. The LRCN [45] and the proposed method are evaluated with using UCF101 dataset. The results are listed in Table 5. The results show that the pseudo recurrent residual architecture performs better than LRCN and our without pseudo recurrent residual architecture separately. The pseudo recurrent residual model is useful for action recognition in video.

F. COMPUTATIONAL COSTS
In this part, we analyze the computational cost of our method. Our IP-RRNNs model consists of two-stream CNNs and two-stream pseudo recurrent residual neural networks. The calculation of optical flow takes up time. It is around 60 millisecond for optical flow calculation of one frame image by GPU acceleration. For the training time, the two-stream CNNs need around one day to train, and twostream IP-LSTMs need several hours to train by a Titan X GPU. In Table 6, we compare the runtime and accuracy of IP-LSTMs with other RNN-like action recognition methods. Our IP-LSTM can reach a speed of 23.2fps. In UCF101, our method is not state-of-the-art in terms of accuracy, but it is computationally efficient.

G. COMPARISON WITH STATE-OF-THE-ART RESULTS
In this subsection, we compare our method against the recently proposed and relevant state-of-the-art methods. Our proposed IP-LSTM model achieves a better performance than the most of previous methods on the two datasets. The comparative results are summarized in Table 7 for the UCF101 and HMDB51 datasets. Currently, the state-ofthe-art of 98.2% on UCF101 and 80.9% on HMDB51 are obtained in literature [6] and [62], which are CNNs-based models and pretrained on Kinetics [86]. Our method is pretrained on ImageNet [41], and somewhat inferior in respect to recognition accuracy. But compared to RNNs based methods VOLUME 8, 2020 such as VideoLSTM [46], LRCN [45], MRNN [79], our accuracy outperforms or basically equals to previous methods on two datasets. But, as show in Table 6, our IP-LSTM model with the number of parameters is much less than other RNNs based architectures. In addition, the RNNs architecture is good at dealing with temporal features, but it may lose spatial features, which is equally critical to action recognition. Therefore, the accuracy of the RNNs-based method is lower than CNNs-based method.
We also notice that our IP-LSTM combines with IDT features can achieve 91.4% and 68.2% accuracy on UCF101 and HMDB51 dataset. Comparing with other deep model fuses IDT features, our method outperforms the VideoLSTM with [46] by 5.2% and outperforms the CNN with IDT model [67] 0.6% on HMDB51 dataset. This implies that our deep learning features are highly complimentary to handcrafted features for action recognition.

V. CONCLUSION
In this study, we propose a pseudo recurrent residual neural network to learn long-term temporal for improving the performance in action recognition. The P-RRNNs architecture consist of two-stream pseudo recurrent residual neural networks for learn video feature from spatial stream and temporal stream. We show that LSTM with pseudo residual learning significantly improve the performance of the network. In addition, P-RRNNs can obtain more robust visual features by using holistic video clip to depict the human action feature compares to the usage of sample frames. Experimental results show that our approach achieves promising performance on UCF101 and HMDB51 datasets, and also obtain further improvements by fusing IDTs features.
For the future works, we will carry out additional studies on modelling deeper P-RRNNs to increase recognition accuracy for action recognition. We will also learn the effective fusing of deep learning features and hand-crafted features.