I. Introduction
The rapid development of imaging equipment results in massive video generation, presenting a requirement to analyze human action in videos for searching, ranking, and intelligent recommendation tasks. The primary action recognition methods can be categorized as deep learning and hand-crafted feature methods. In the last decade, depth learning methods have been widely used in human action recognition because they can automatically extract spatiotemporal features from image sequences and significantly improve recognition accuracy compared with traditional methods. The commonly used depth network structures include Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Convolution Networks (GCNs), among which CNNs are widely used in action recognition because they can directly extract features from images.