Human Action Recognition Algorithm Based on Multi-Feature Map Fusion

The emergence of the convolutional neural network greatly improves the accuracy of human action recognition. However, with the deepening of the network, fewer and fewer features are extracted, and in some datasets, due to the shooting angle, the size of the target to be recognized is different. To solve this problem, on the basis of resnext human action recognition method, we propose an improved resnext human action recognition method based on multi-feature map fusion. First, the video is uniformly sampled to generate training samples, and we generate samples with different frames as the input to the network. Second, we add n layers of up-sampling layers after layer 1 of resnext, to enlarge the feature maps and extract multiple feature maps, so that the extracted feature maps are clearer, and small targets can be better recognized. Finally, for the n results obtained, we use the weighted geometric means combination forecasting method based on L_1 norm to fuse and obtain the final result. In the process of experiment, using UCF-101 and HMDB-51 for verification, the accuracy of our model is 90.3% on UCF-101, which is higher than most of the state-of-art algorithms.


I. INTRODUCTION
Due to the potential applications of human action recognition in video surveillance, behavior analysis, video retrieval, and other fields, human action recognition has become a very important field in computer vision research [1]. Human action recognition refers to the video sequence of human action, through the detection, classification, and tracking of moving targets, analysis and recognition of human action, and description in natural language [2]. In the real world, human action recognition plays a basic role in video analysis.
Early human action recognition was based on hand-crafted features. These features are more dependent on databases, they performed well on some databases, however, these features are not necessarily applicable to other databases. Moreover, hand-crafted features take a long time, which is not conducive to feature extraction of large databases.
With the rise of deep learning, the learning of automatic feature engineering has solved the shortcomings of hand-crafted feature engineering and made significant progress in the field of human action recognition. However, because of the long information period, a large amount of The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues . redundant information, complex backgrounds, and the diversity of viewpoints [3], there are still many problems to be solved in human action recognition methods based on deep learning.
In the context of big data, human action recognition has broader application scenarios, including video recommendation, monitoring analysis, human-computer interaction, etc. However, although the current algorithm has achieved good performance, it still needs to be improved in accuracy and running speed. Therefore, in order to ensure accuracy and improve the running speed, the human action recognition algorithm is still the focus and difficulty of current research.

II. RELATED WORK A. FEATURE-BASED APPROACHES
Some classic image feature extraction methods are generalized to video [4], traditional human action recognition feature extraction methods include SIFT (Scale-invariant feature transform) relying on prior knowledge and HOG (Histogram of Oriented Gradient); SIFT improved SURF (Speeded-Up Robust Features) [5] algorithm and 3D-SIFT [6]  The detection operator of the traditional feature extraction algorithm is artificially designed and obtained by a large amount of prior knowledge. Therefore, the traditional algorithm is time-consuming and the workload is heavy. With the advent of deep learning, more and more studies are affected by the significant achievements of convolutional neural networks in static image recognition. In action recognition, the use of deep models to train end-to-end networks has clearly exceeded hand-crafted features [8].

B. CNN BASED APPROACHES
Human action recognition methods based on convolutional neural network architecture are roughly divided into 2D convolutional neural networks and 3D convolutional neural networks [9]. The 2D convolutional neural network achieves good performance in static image recognition [10], [11]. It is easy to apply a 2D convolutional neural network to video representation to extract features, but ignore the relationship between video frames. For this reason, Ji et al. [12] proposed a 3D convolutional neural network method. A 3D convolutional neural network consists of a 2D spatial dimension and a temporal dimension, 3D convolutional neural networks consider time information, but have too many parameters. Compared to 2D convolutional neural networks, 3D convolutional neural networks are more difficult to learn [13].
2D two-stream network architecture [14]. This method divides the video into two parts: the temporal domain and the spatial domain. The obtained RGB images and optical flow frames are input into the network to perform feature learning in temporal and spatial domains. Finally, the prediction is performed only by averaging the classification scores, and only the convolution operation between adjacent frames cannot obtain long-term information. Based on the problems existing in the 2D two-stream network architecture, it was improved and a 3D two-stream network architecture was proposed. The I3D proposed by Carreira and Zisserman [15] increases the length of the input video clip to obtain a longer range of information, but it is computationally intensive and cannot handle a longer range of video. Based on I3D, Wang [16] proposed a non-local neural network, which uses the spatiotemporal non-local relationship in the video. Xu et al. [17] and Qiu et al. [18] proposed a coding method to obtain the video-level representation but ignored the connection between frames. TSN proposed by Wang et al. [19] adopts the sparse and global sampling method to sample a fixed number of frames to cover the time sequence structure of a long-time range so that the entire length of a video is not considered and the fusion is carried out at the end. Different from the above-mentioned spatio-temporal features, [20] encodes spatio-temporal features by imposing a weight sharing constraint on the learnable parameters so that practice and spatial features can benefit from each other through collaborative learning. The above-mentioned spatio-temporal fusion methods are time-consuming and expensive for network training. In order to solve this problem, Zhou et al. [21] proposed a new method to embed the spatiotemporal fusion strategy into a pre-defined probability space so that any multiple fusion strategies can be evaluated at the network level without having to train them separately, which greatly improves the strategy for spatiotemporal fusion Analysis efficiency. Inspired by FPN [22], we propose a multi-scale fusion method for human action recognition; besides, by observing the datasets, we found that the background information of some actions is complex, and the targets we want to recognize are small relative to the entire background, so we use the up-sampling method to enlarge the feature maps to make small targets clearer and easier to detect. Moreover, deep learning is learned by learning the extracted features, the more detailed and clear the extracted features, the better the learning effect. Therefore, on the basis of resnext, we propose the method of Human Action Recognition Algorithm Based on Multi-feature Map Fusion, adding n layers of up-sampling layers after layer1 to train separately, which aims to enlarge the feature maps to make the extracted features clearer. In order to solve the problem of obtaining the final result only by the average classification scores in the 2D two-stream network architecture [14], we propose to use the weighted geometric means combination forecasting method based on L_1 norm to fuse the obtained n results.

III. RESNEXT-101
We first introduce the architecture of resnext. The traditional way to improve the accuracy of the model is to deepen or widen the network, but as the number of hyperparameters increases (such as the number of channels, filter size, etc.), the difficulty of network design and the computational overhead also increases. Therefore, the proposed resnext structure can improve the accuracy without increasing the complexity of the parameters, while reducing the number of hyperparameters.
Xie et al. [23], proposed the network resnext, adopting the idea of VGG stack and Inception's split-transform-merge at the same time, but with strong expansibility, which can be considered as increasing the accuracy while not changing or reducing the complexity of the model. There is a noun called cardinality, with the mean of the size of the set of transformations. Experiments demonstrate that increasing cardinality is a more effective way of gaining accuracy than going deeper or wider [23]. Fig. 1 shows a block of resnext.
Based on the block of resnext, the internal structure of resnext-101 is shown in Table 1.
Shallow features contain detailed information, deep features contain semantic information, semantic information can help to accurately detect the target. However, as the network [23] deepens, more useful features are filtered out. And in the dataset, the size of the target to be recognized is different. Therefore, we add n up-sampling layers after layer 1 of resnext-101 to amplify the extracted features, so that the network extracts more detailed features and small targets are better recognized.
Our work is different from the previous approaches in two main aspects [24]: (1) Based on resnext-101, we add  several up-sampling layers [25] to extract more feature maps; (2) Several groups of results obtained are fused using the weighted geometric means combination forecasting method based on L_1 norm to get the final result.

IV. HUMAN ACTION RECOGNITION ALGORITHM BASED ON MULTI-FEATURE MAP FUSION A. NETWORK ARCHITECTURE
The input image of the neural network input layer is convolved with the convolution kernel to obtain the feature map. A feature map is a description of the characteristics of the image, the more features extracted and the more detailed, the identification effect is better. Therefore, the more feature maps, the more representative the extracted features will be, and the better the recognition effect will be. The shallow features show more detailed information, while the deep features contain more semantic information, which can help accurately detect the target. However, after multi-layer convolution, many features have been filtered out, so on the basis of the resnext-101, we proposed a new architecture named Human Action Recognition Algorithm Based on Multi-feature Map Fusion. N layers of up-sampling are added after layer 1 of resnext-101 for prediction.   2 is the architecture of our network. First, the sampling method is uniform sampling, and the default size of each sample is 3 channels × 16 frames × 112 pixels × 112 pixels. Second, we use stochastic gradient descent to train the network and get n prediction results. We calculate the weights of the training results through the above model and fuse the test results according to the weights to get the final results.

B. UP-SAMPLING METHOD
On the basis of resnext-101, we separately add n up-sampling layers to extract more features. To make this more concrete, we now discuss several ways of up-sampling methods:

1) NEAREST NEIGHBOR INTERPOLATION
The simplest interpolation method, we obtain the coordinate (srcX, srcY) of the source image corresponding to (dstX, dstY) by (1), and fill in the corresponding pixel value.
VOLUME 8, 2020 (srcWidth, srcHeight) indicates the width and height of the source image, (dstWidth, dstHeight) indicates the width and height of the image after interpolation.

2) BILINEAR INTERPOLATION
This method is to calculate the pixel value of point P(x, y) according to the pixel values of the nearest four points of point P(x, y). The core idea is to perform linear interpolation in two directions respectively. Q 11 , Q 12 , Q 21 , Q 22 pixel values are known, we first use (2) to calculate the pixel values of R 1 , and R 2 . Then, we calculate the pixel value of P using (3). We substitute (2) into (3) to get (4). x direction: y direction: So the pixel value at point P is:

3) TRILINEAR INTERPOLATION
Trilinear interpolation operation in n = 1, three-dimensional D = 3 parameter space, so that 8 points adjacent to the point to be interpolated are needed. On a periodic cube grid, let x d , y d , z d be the differences between each of x, y, z, and the smaller coordinate related, that is: where x 0 is the point below x and x 1 is the point above x, y 0 , y 1 , z 0 , z 1 are the same. f 000 , f 001 , f 010 , f 011 , f 100 , f 101 , f 110 , f 111 pixel values are known, first, we calculate the pixel values of R 1 , R 2 , R 3 , R 4 using (6). Then we use (7) to calculate the pixel values of r 1 , r 2 . Finally, we calculate the pixel value of using (8).

FIGURE 4. Trilinear interpolation.
First, interpolate in the x direction: Then interpolate in the y direction: Finally, interpolate in the z direction: The nearest neighbor interpolation method will cause a discontinuity in the grayscale of the generated image. When the feature map is enlarged, this method directly uses the nearest pixel to generate a new pixel, so where the grayscale changes, there is obvious jagged; The calculation of the bilinear interpolation method is complicated, and the amount of calculation is large, but the calculation result of the four pixels used by this method greatly eliminates the phenomenon of jaggedness and has no disadvantages of grayscale discontinuity. However, the bilinear interpolation has the characteristic of low-pass filtering, so that high-frequency components are damaged, so the edges will become blurred; The trilinear interpolation method overcomes the shortcomings of the above two methods, with high calculation accuracy and better effect. Therefore, we choose the trilinear interpolation as the up-sampling method.

C. FUSE METHOD
We train these networks separately to obtain different training results, the weights of the training results are obtained by the method of [26], when evaluating the network, the weights obtained by [26] are used to fuse the obtained results to obtain the final result. The combined prediction model of [26] can be expressed as: Among them, F(L) is the logarithmic error based on the L1 norm between the combined prediction method of the weighted geometric average and the actual value of the index sequence, e it represents the logarithmic error between the predicted value x it and the actual value x t at time t of the i-th prediction method. The smaller F(L) is, the closer the combined prediction method of weighted geometric mean is to the actual value of the index series, thus the more accurate and effective it will be.

V. EXPERIMENT A. DATASET
UCF-101 is one of the databases with the largest number of action categories and samples, which contains 13320 videos and 101 categories. The database samples are taken from various sports samples collected from the BBC/ESPN and downloaded from the Internet. Fig. 5 shows several clips of UCF-101. HMDB-51 contains 6849 videos and 51 categories. Each category contains at least 101 videos. Most of the videos are from movies, and some are from public databases such as YouTube. Fig. 6 shows several clips of HMDB-51. We use split1 of UCF-101 and HMDB-51 for training and validation. When testing, the dataset is the same as the validation set.

B. IMPLEMENTATION
In the experiment, we take n = 1,2, that is, add one and two up-sampling layers respectively for the experiment.
Training: We use SGD with momentum to train the networks. At the same time, in order to increase the data, we randomly generate samples in the video of training data. Fig. 7 shows the method of generating training samples. Firstly, we choose a temporal location in the video, generate the training samples by uniform sampling, and then produce a 16-frames clip around it. If the video is less than 16 frames, we loop it multiple times. Then, we randomly pick a spatial location from four angles or the center and select the spatial scale of a sample for multi-scale cropping.
The aspect ratio of the sample is 1, and the sample is spatio-temporally cropped. Finally, we get that the size of each sample is 3channels × 16frames × 112pixels × 112pixels. All the resulting samples retain the same class labels from their original videos.
In the training, we initialize the parameters of the network with resnext-101 pre-trained on Kinetics and fine-tune the last two layers using SGD optimizer with momentum 0.9. We start with the learning rate of 0.05 and divide it by 10 after the validation loss saturates. To prevent overfitting, we also add dropout with 0.5. The loss function we use is the cross-entropy loss function, y represents the predicted value, y represents the actual value.
Recognition: We use a sliding window to produce the input clips. Then, we input the clip into the network and evaluate the class score of the clip, which is the averaged of all the clips. We use the method of up-sampling to generate two feature maps with different scales using the above method to train and recognition and get different class scores. Finally, we fuse the different class scores through [23] to get the final class score. VOLUME 8, 2020

C. FEATURE MAPS
We use three different up-sampling methods to experiment separately. Fig. 8 shows the feature maps of different up-sampling methods.
We separately enlarge the same one feature map of these three feature maps, as shown in Fig. 9 we can see that the feature map obtained by the nearest neighbor interpolation method is fuzzy and has obvious jagged; The feature map obtained by the bilinear interpolation method is not obvious jagged, but the edge is blurred; The feature map obtained by the trilinear interpolation method is not obvious jagged, and the edge is clearer. These results prove that our choice in IV is correct.

D. RESULTS
We separately sample a 16-frames clip and a 32-frames clip for training. Fig. 10 and 11 separately show the training and validation losses of different sample-durations. As can be seen in the figures, the validation losses are slightly higher than training losses on UCF-101, which indicates that the network performs well on UCF-101. However, the validation losses quickly converge and are higher than the training losses on HMDB-51, which indicates that the performance of the network on HMDB-51 is not as good as UCF-101. For evaluating the network, we measure clip and video accuracies. We take the maximum score to each clip score. Then we get the video score by averaging clip scores. Table 2 shows the accuracies of different sample-durations. With the number of sample-duration increases, the clip-level accuracy and video-level accuracy also improve.
We then compare our architecture with state-of-art algorithms and the results are presented in Table 3.  We can see that the proposed architecture achieved higher accuracies compared with other state-of-art algorithms.
Moreover, our architecture improved 4.4% and 2.3% compared with the most effective iDT [4] with the hand-crafted feature, and the two-stream [14] with a deeper feature on the UCF-101. Fig. 12 shows the confusion matrices of different sampledurations of UCF-101 and HMDB-51 respectively.
In UCF-101, most of the classes perform well, even some classes, such as ApplyEyeMakeUp, ApplyLipstick, Archery, BabyCrawling, Rowing separately reach the accuracies of 99.8%, 99.7%, 99.9%, 99.9%, 99.8%. However, as shown in Fig. 12 (a) and (b), the most confused classes are: Shotput with VolleyballSpiking (39.1%), Skiing with Surfing (35.0%), ShavingBeard with ApplyEyeMakeUp (32.6%), RockClimbingIndoor with RopeClimbing (32.7%), Play-ingFlute with PlayingViolin (29.2%). Fig. 13 shows the most confused classes of UCF-101. We can see that Shotput and VolleyballSpiking are similar movements and have many people in the scenes. Skiing and Surfing are from the same sports and have the same background. ShavingBeard and ApplyEyeMakeUp are in a similar room of a house and have similar movements. RockClimbingIndoor and RopeClimbing are from the same type of sports. PlayingFlute and Play-ingViolin are playing Musical Instruments, they're in similar positions and moving their hands.
In HMDB-51, these classes do not perform as well as UCF-101 does. As shown in Fig. 12 (c) and (d), the most confused classes are: ride_horse with ride_bike (56.6%), smile with laugh (43.3%), talk with chew (36.7%), cartwheel with flic_flac (30.0%), walk with run (26.7%). Fig. 14 shows the most confused classes of HMDB-51. Ride_horse and ride_bike both have the riding movements and a similar scene. Smile and laugh are similar facial expressions. Talk and chew have similar mouth movements. Cartwheel and flic_flac are two similar actions. Walk and run have similar leg movements. However, there are some classes confused with other classes. Fall_floor is confused with kick_ball, jump, punch, stand. Shoot_bow is confused with shoot_gun, jump, punch.

VI. CONCLUSION
At present, human action recognition is the focus and difficulty of research and has a very wide application prospect, mainly used in monitoring, human-computer interaction, and other scenarios. Due to the complexity and diversity of human action, research on human action recognition has great challenges. This paper presented a solution to improve the performance of human action recognition. we proposed the architecture based on Multi-feature Map Fusion, which uses multiple up-sampling layers to enlarge feature maps, so that smaller targets can be better detected, at the same time, the information of the features extracted by the network is more and clearer. In our architecture, for the up-sampling method, the nearest neighbor interpolation method, the bilinear interpolation method, and the trilinear interpolation method have been studies. Experiments show that the effect of the feature map obtained by the trilinear interpolation method is better than the other two methods. Simultaneously, we used the clip with different sample-durations for training.
The results indicate that with the number of sample-duration increases, the accuracies also improve. Finally, for the results obtained by the network, we did not use the method of averaging scores as mentioned in [14] to fuse the results. We proposed to use the weighted geometric means combination forecasting method based on L_1 norm to fuse the obtained n results. The proposed architecture achieved 90.3% and 58.4% on UCF-101 and HMDB-51, which illustrates that the architecture is effective and comparable.