Introduction
With the rapid advancement of computer technology and the widespread adoption of the internet, video data has witnessed a notable trend of growth in both scale and complexity. The richness and diversity of video data have elevated it to a crucial data type in various industries, finding extensive applications in fields such as video surveillance, sports analysis, and intelligent driving. In these applications, video understanding technology plays a pivotal role, enabling people to recognize, analyze, and comprehend various behavioral actions from vast video datasets, as exemplified in Fig. 1’s action recognition. To better exploit the potential of video data, the effective recognition of behavioral actions within videos becomes paramount. However, the complexity and scale of video data present certain challenges to the research and application of video understanding technology, including issues pertaining to video data storage, processing, computation, analysis, and comprehension.
Catching or throwing a ball. The video encompasses a wealth of information, and through analysis of multiple video frames, it becomes evident that the individual within the frame is engaged in a pitching action.
In the past decade, the widespread application of convolutional neural networks (CNNs) has greatly enhanced the efficiency of problem-solving. Since the introduction of the AlexNet network [1], computer vision tasks have experienced significant advancements, with CNNs playing a crucial role in image classification [1], [2], [3], [4], [5], object detection [6], [7], [8], semantic segmentation [9], [10], [11], and other domains, yielding remarkable results. Simultaneously, in the field of action recognition, the application of CNNs has also led to a gradual improvement in accuracy, contributing significantly to the advancement of this domain [12], [13], [14], [15], [16], [17], [18], [19]. Moreover, in recent years, action recognition work based on transformer models [21], [22], [23], [24] has been flourishing and has garnered widespread attention from both academia and industry.
Regarding video processing tasks, temporal modeling is of paramount importance. In convolutional neural networks (CNNs), to enhance the modeling capability of the temporal dimension, most methods adopt 3D convolutions [16], [25], [26], [27], enabling the network model to be applicable to video processing tasks. However, with the rise of transformer models in the field of natural language processing (NLP) [20], we have witnessed outstanding performance in handling text sequences. Similarly, in computer vision, video data can be perceived as longer sequences, which further extends the application of transformer models in video understanding. In fact, in recent years, the number of transformer-based video understanding approaches has been increasing, becoming one of the hot research topics in this field.
Our approach utilizes a transformer-based network for video processing. However, compared to handling individual images, processing video data requires greater computational resources. To mitigate the computational complexity, we draw inspiration from the seminal work of 2D CNN video processing, particularly TSM [29]. We incorporate shift modules to simulate temporal interactions, enabling more effective information exchange and reducing the computational burden on the temporal dimension. To provide a more intuitive representation, we perceive a video as a sequence of continuous T frames, where each frame can be denoted by
Shift module, with the left image depicting the original non-shifted state, and the right image showing the state after shifting. In our transformer, we only apply shifts to the cls token to achieve information exchange between video frames from different time intervals.
Our action recognition processing model is based on the Vision Transformer [30], which is originally designed for 2D image classification. In combination with the shift operation, we simulate the interaction process of temporal information, allowing the video frames entering the network to move bidirectionally in the channel dimension. This shift operation serves as a zero-computation module and can be easily embedded as a plug-in into 2D transformers, enabling various 2D transformer networks to be applicable for video understanding tasks.
Please note that our shift module does not move all tokens in the ViT, as doing so would incur significant data movement costs. Instead, we choose to move only the cls token, as based on the interpretation of ViT, the cls token can be considered as a token containing global information. Therefore, shifting the cls token can yield approximate effects while saving considerable resources and time. Furthermore, despite the shift module having zero computational cost, channel shifting comes with additional overhead. Thus, we only apply the shift operation before the multi-head attention, similar to the approach used in 2D CNN with shift modules [29]. This approach significantly alleviates the burden on hardware processing.
The contributions of our research work can be summarized as follows:
Our proposed ViT-Shift action recognition model leverages the shift module to process videos, enabling the 2D Transformer network to handle video data more effectively. Moreover, this approach not only reduces computational complexity but also enhances the model’s fitting accuracy.
We carefully designed the utilization of the shift module, applying it only at critical positions, and experimental results have demonstrated the effectiveness of this approach. Compared to methods that apply shift operations at multiple positions, our strategy not only enhances the model’s accuracy but also reduces the hardware burden.
We solely utilize video RGB data as input and achieve impressive accuracy rates of 77.55% on the Kinetics-400 dataset and 93.07% on the UCF-101 dataset, with pretraining on ImageNet-21k. These results affirm the correctness of our approach, and we firmly believe that the shift module holds the potential for broad application and promotion in the field of computer vision, fostering further advancements in related research.
Related Work
A. Convolutional Neural Networks (CNN)
The Convolutional Neural Network (CNN) is a versatile and effective approach widely applied in computer vision, particularly for video recognition tasks [12], [14], [19], [32], [33]. Among them, the two-stream network stands as a classic architecture in deep learning [12]. Comprising two distinct network pathways, one dedicated to processing RGB inputs and the other to handling optical flow frames, these pathways interact at the network’s final stage to yield video recognition results. This seminal work has inspired numerous subsequent studies [15], [31], [45]. Notably, Temporal Segment Networks (TSN) [15] introduced segment-based sampling and aggregation modules, enabling the network to efficiently learn temporal information from distant time points in videos, significantly advancing the development of two-stream networks. Furthermore, the BS-2SCN network [45] introduced Bidirectional Gated Recurrent Units (BiGRU) and an attention mechanism based on neural science theory (SimAM), enhancing the spatial flow network’s capability to extract features related to action appearance. This augmentation contributes to the improvement of the neural network’s accuracy and stability. While two-stream networks can leverage results pre-trained on large-scale image datasets using 2D networks, handling optical flow data remains a cumbersome task.
With the advent of the C3D architecture [25], the research on 3D convolutional networks for video recognition entered a new era. C3D, a neural network featuring 3D convolutions, provided valuable insights for the subsequent development of 3D networks. The introduction of the I3D model [16] brought the concept of inflation to the forefront. The 3D network generated through inflating a 2D image classification network can address the issue of limited training data caused by the unavailability of the Imagenet dataset in C3D, thereby significantly enhancing video recognition accuracy. The SlowFast model [27] introduced a novel CNN architecture that combines two pathways—one specialized in processing temporal information and the other in spatial information. Their fusion within the network achieved state-of-the-art performance. X3D [19] is one of the best CNN models for video recognition, It is achieved through an iterative search process, utilizing a network search technique to explore various factors influencing video recognition. This approach yields optimal values for spatial-temporal resolution, feature channels, and network depth settings, leading to superior performance in video recognition tasks.
B. Decomposed Convolutional Neural Networks
To reduce the complexity of network models for video recognition, many researchers have focused on exploring methods to reduce network size while maintaining similar recognition performance. P3D [28] proposed a method that decomposes the convolution used for processing spatio-temporal information into two separate convolutions: one dedicated to processing spatial domain information independently and the other specialized in handling temporal domain information. This approach significantly reduces the complexity of the network and improves its computational efficiency. S3D [33] introduced a novel convolution strategy, where convolutions are performed separately along the temporal and spatial dimensions, replacing the 3D convolutions in the I3D network. This strategy greatly reduces the model’s parameter count and enhances network performance. R (2+1) D [32] proposed a method of decomposing the 3D convolution kernel into independent spatial convolution kernels and temporal convolution kernels. This approach improves the network’s recognition accuracy while reducing computational overhead. Compared to I3D, this method exhibits faster convergence and is easier to train. Additionally, Zhang et al. [49] introduced a residual network for optimization, decomposing the network structure of three-dimensional convolutional kernels into two types of kernels: spatial flow and temporal flow. They performed data flow fusion based on the two-stream network, effectively improving training rates.
C. Next-Generation Video Understanding Models
The Transformer was originally proposed in the field of natural language processing and was used for sequence processing tasks, showing promising performance in various tasks such as machine translation. Its self-attention mechanism enables effective handling of long-range dependencies in these tasks. In recent years, the Transformer has found widespread applications in natural language processing, computer vision, recommendation systems, and other domains. With the emergence of Vision Transformer (ViT) in the visual domain [30], the Transformer has gradually been applied to computer vision and demonstrated higher accuracy than traditional CNNs. Consequently, an increasing number of works have been built upon the Transformer architecture in computer vision, such as Cait [34], DeiT [35], and Swin Transformer [36], among others. These works have significantly improved the accuracy of image classification. Similarly, in the field of video classification, the introduction of VTN [24], VIVIT [21], TimeSformer [22], Tokshift [23], and LAPS [39] among others, has further facilitated the application of Transformers in video recognition tasks. Among them, TimeSformer [22] adopts a spatiotemporal decomposed attention mechanism, which calculates self-attention separately for the spatial and temporal dimensions of the video. This approach effectively reduces training and inference costs, leading to outstanding performance in video recognition tasks. VTN [24] conducts feature extraction for each frame in a video, and through temporal processing components, efficiently captures spatiotemporal dependencies in video sequences. This enhances the model’s efficiency and accuracy, showcasing robust capabilities in video recognition tasks.
The Network Model
Our model draws inspiration from the TSM network [29], which has demonstrated excellent performance in the 2D CNN domain, achieving high network accuracy at a relatively lower computational cost. The pivotal role played by the shift module in the TSM network has motivated us to incorporate the shift module into the ViT network. The overall architecture of our network model is illustrated in Fig. 3. Next, in Section III-A, we will delve into the shortcomings of current action recognition models and introduce the advantages of our proposed new model. In Section III-B, we will provide a detailed description of ViT, the first transformer-based network applied to computer vision. Section III-C will focus on describing the shift module, a component designed to reduce computational costs, and its application to video understanding tasks. Finally, in Section III-D, we will provide a comprehensive explanation of our new model, named ViT-Shift, which integrates ViT and the shift module and applies them to video action recognition tasks.
Overall architecture of our network model. In the design, we adhere to the principles of Vision Transformer while making efforts to retain its original structure and make necessary adjustments for video processing. After the tensor information is processed by Transformer (shift), it undergoes mean pooling before being utilized for prediction.
A. Problem Analysis
Currently, Transformer-based action recognition models have made significant progress in video understanding tasks. However, there are still some limitations. One crucial challenge is the modeling and interaction of temporal information. Traditional Transformer-based approaches typically use temporal attention mechanisms to handle temporal information in video sequences. However, this approach requires computing a large number of temporal attention weights, leading to high computational costs for the model.
To overcome these challenges and improve the performance of action recognition models, we propose an innovative Transformer-based action recognition model called ViT-Shift. Unlike previous methods, our model utilizes shift operations to simulate temporal interactions, avoiding the cumbersome computations of time-based attention. This shift operation efficiently captures temporal information in video sequences and significantly reduces computational complexity. In detail, we elaborate on the principles and applications of the shift operation in Section III-C. Additionally, the overall architecture of our model is depicted in Fig. 3, with further details available in Section III-D. By introducing shift operations, our model gains a better understanding and encoding of action patterns in videos, thereby enhancing accuracy.
Furthermore, compared to traditional convolution-based methods, our Transformer-based action recognition model exhibits significant advantages. Firstly, the Transformer model is capable of better modeling long-range dependencies, allowing the model to capture global contextual information in videos. Most importantly, our model demonstrates outstanding accuracy in action recognition tasks, fully validating its effectiveness and feasibility.
B. Vision Transformer (ViT) Overview
ViT [30] is regarded as one of the state-of-the-art models in the field of image classification, with its input being a single-frame image denoted as
1) Embedding Layer Structure:
The primary function of the Embedding layer in ViT image processing is to divide the input image into N non-overlapping blocks of equal size, denoted as \begin{equation*} z=[z_{cls},Ex_{0}^{1},Ex_{0}^{2},\ldots,Ex_{0}^{N}]+P \tag{1}\end{equation*}
In the above description,
2) Transformer Encoder:
After the Embedding module, N tokens are passed through L stacked Encoder Blocks. The Encoder Block primarily consists of Layer Norm, Multi-Head Attention, Dropout (or DropPath), and the final MLP Block. Layer Norm was originally proposed in the field of NLP, and with the introduction of Transformer, Layer Norm is also applied to normalize each token. Multi-Head Attention comprises multiple attention mechanisms, each separately calculating attention distributions. The calculation formula for each attention mechanism is as follows:\begin{equation*} Attention(Q_{i},K_{i},V_{i})=softmax\left({\frac {Q_{i}K_{i}^{T}}{\sqrt {d_{k}}}}\right)V_{i} \tag{2}\end{equation*}
After computing the attention for individual attention heads, a concatenation operation is performed. The specific computation formula is as follows:\begin{align*} Attention(Q,K,V) &= Concat(h_{1}, h_{2}, \ldots, h_{m}) \cdot W^{o} \\ h_{i}& = Attention(Q \cdot W_{i}^{Q}, K \cdot W_{i}^{K}, V \cdot W_{i}^{v}) \tag{3}\end{align*}
3) MLP Head:
Class token added at Embedding layer has global information. In the processing of the L Transformer Encoder, the cls token will participate in the computation of the attention mechanism as an independent token along with the token from the image block. In this interactive processing, the cls token will integrate the global image information and be used in the MLP Head to perform the classification task. The overall flow of ViT processing is shown below:\begin{align*} x^{l_{i}}&=MSA(DropPath(LN(z^{l})))+z^{l} \\ z^{l+1}&=DropPath(MLP(LN(x^{l})))+x^{l} \\ Class&=MLPHead(z^{l}) \tag{4}\end{align*}
C. Shift Module
The Shift module is designed to simulate the interaction of temporal information and reduce computational complexity. Before processing with the Multi-Head Self-Attention (MSA) mechanism, the extracted temporal information from different video frames is fused. Specifically, this module achieves computational efficiency by combining the video frames at time steps t-1 and t+1 with the current frame at time t, thus reducing the significant computational overhead required by time-based attention. In the subsequent discussions, we will use the symbol \begin{equation*} c_{l}=shift(c_{l}) \tag{5}\end{equation*}
We adopt the shift module that employs the idea of partial channel shifting for tokens. This approach is inspired by previous research [29] that utilized partial in-place shifting operations in 2D CNNs to achieve improved accuracy. Therefore, we interact the partial shift of the current time step with the preceding and succeeding frames while preserving the original semantic information of the unselected parts. However, the interaction process may lead to some loss of semantic information in certain parts. To address this, we use zero-padding operations to maintain the dimensionality of tokens unchanged.
We observed that performing shifting operations on all tokens at each time step
Specifically, in the self-attention mechanism, each token at a given position calculates its relevance to other tokens in the sequence, with these relevances represented as weights. Since the cls token is positioned at the start of the sequence and does not correspond to a specific local region, it can interact with each token that retains information about a specific local area of the image. This allows the cls token to serve as a central hub for global information, comprehensively considering various parts of the image in the self-attention mechanism. Through the processing of the multi-head self-attention mechanism, the cls token eventually fuses feature information from the entire image and encodes global semantic information into its representation.
This integration of global information makes the cls token a focal point for the model to holistically consider key elements of the entire image context during the inference stage, providing more accurate predictions for classification tasks. Since the cls token has already incorporated global information, simulating interaction effects between different time frames can be effectively achieved by shifting only the cls token.
So, we represent the cls token as \begin{align*} c_{l}^{t}&=concat(c_{left}^{t},c_{mid}^{t},c_{right}^{t}) \\ c_{left}^{t}&=c_{left}^{t-1} \\ c_{mid}^{t}&=c_{mid}^{t} \\ c_{right}^{t}&=c_{right}^{t+1} \\ div&=\frac {c_{left}}{D}\left({or \frac {c_{right}}{D}}\right) \tag{6}\end{align*}
In the previous section, we mentioned the properties of the shift operation, which is a zero-computation component that simulates the interaction of information and can save a significant amount of computation. However, too many shifts can lead to increased memory consumption and latency overhead, so caution is needed when using the shift module. The multi-head attention mechanism in ViT is similar to the convolution operation in neural networks, but it is more powerful. Therefore, we place the shift module only after the Layer Norm before the multi-head attention. Through experimental verification, we obtain satisfactory accuracy results, and detailed experimental results will be shown in Section IV. In addition, the setting of the shift ratio parameter is also our concern, and the appropriate shift ratio is crucial to the model performance. We will make comparisons in the ablation experiments to find more optimal shift ratio parameter settings.
D. ViT-Shift
The ViT structure is relatively straightforward, making it easy to train and adjust. Simultaneously, ViT possesses the capability to capture long-range dependencies in image sequences, exhibiting excellent generalization performance when handling large-scale data. This makes it a powerful tool for addressing complex visual tasks. The adaptability of ViT is particularly advantageous when applied to datasets for video action recognition.
In our work, we followed the design of the original ViT model and adapted it for the video understanding task. The main modifications were made to the input part of the model, and we added our shift component before the multi-head attention operation. The structure of our ViT-Shift network model is depicted in Fig. 4.
Model network architecture. The Shift Block represents the temporal shift module used for information interaction between different time frames. The encoder is stacked L times to process the data sequentially.
In the original ViT, the input consists of single-frame images, denoted as
In our model, all processed tokens are input into the encoder. Before entering the encoder, we add position encodings to each token to retain the original spatial position information of the video frames. Additionally, we introduce a global information cls token at this stage. Finally, the transformed input, adapted to the transformer format, is fed into the encoder for further processing.
In the encoder part, we perform repeated stacking of L layers on the encoder, and the dimension information of all tokens in the encoder is kept constant during the processing. By controlling the size of the value of L, our model can be manually adjusted to fit different hardware devices according to the actual application requirements. In each layer of the encoder, we first perform LN (Layer Normalization) on the tokens for normalization, and then introduce a shift module, which we designed to simulate the interaction of temporal information, for fusing information from different times in the subsequent multi-head self-attention operation. In order to prevent overfitting and increase the robustness of the model, we also applied the DropPath mechanism for random depth. Subsequently, all the tokens are again normalized by the LN and then mapped through the MLP module, which consists of two linear layers and a GELU activation function.
After undergoing L rounds of encoding, our model performs classification output through the MLP Head module. Below is a detailed processing flow of our model in video recognition tasks:\begin{align*} c^{l}&=shift(c^{l}) \\ z^{l}&=Drop\_{}path(MSA(LN(z^{l})))+z^{l} \\ z^{l}&=Drop\_{}path(MLP\_{}block(LN(z^{l})))+z^{l} \\ y&=\frac {1}{T}\sum _{i=1}^{T}MLP\_{}Head(c^{l}) \tag{7}\end{align*}
Experimental Evaluation
We evaluated our model on two standard datasets: the UCF-101 dataset [42] and the Kinetics-400 dataset [37]. As a baseline, we employed the original Vision Transformer architecture pretrained on ImageNet-21K. Our model is adaptable to various ViT sizes and exhibits excellent scalability. In Section IV-A, we provided a brief introduction to the datasets used. Next, in Section IV-B, we presented a detailed description of our model implementation. In Section IV-C, we conducted ablation studies to determine the optimal network hyperparameters. In Section IV-D, we present visualizations of our model’s outcomes to underscore its prowess in capturing pivotal action-related insights. Lastly, in Section IV-E, we compared our model with other state-of-the-art approaches to demonstrate its feasibility and effectiveness.
A. Dataset
Kinetics-400 dataset [37]. The Kinetics-400 dataset is a large-scale and high-quality dataset of YouTube video clips aimed at encompassing a wide range of human-centered actions. This dataset consists of 400 human action categories, each containing at least 400 video clips. Each video clip has a duration of approximately 10 seconds and is sourced from various YouTube videos. The Kinetics-400 dataset covers a diverse set of action categories, including interactions between humans and objects (e.g., playing instruments) and interactions between individuals (e.g., shaking hands), as well as other relevant categories. As one of the standard datasets in the field of action recognition, its scale and quality provide a crucial benchmark for research in this domain.
UCF-101 dataset [42]. The UCF-101 dataset is an action recognition dataset of realistic action videos derived from YouTube, which contains 13,320 video samples from 101 different action categories. The dataset exhibits the greatest diversity in actions, covering significant differences in camera motion, object appearance and pose, object scale, viewpoint, background clutter, and lighting conditions. This makes the dataset challenging. Since most of the available action recognition datasets are mostly less than realistic and were constructed through staged participants, the goal of the UCF-101 dataset is to facilitate further research into the field of action recognition by learning and exploring new realistic action categories. The action categories in this dataset can be categorized into the following five types: 1) human-object interactions; 2) those involving only limb movements; 3) human-human interactions; 4) playing a musical instrument; and 5) physical activities.
We conducted experiments using the split-1 of the UCF-101 dataset and reported its performance metrics. Split-1 consists of 9,537 training videos and 3,783 validation videos. The details of the dataset information are provided in Table 1.
B. Implementation
Our model is developed based on the ViT architecture, making it applicable to various scales of ViT network models with excellent scalability. In our study, we chose to work with datasets comprising short videos lasting no more than 10 seconds, a common practice in the field. Short videos offer advantages in action recognition tasks as they are more adept at capturing essential information pertaining to specific actions, mitigating the presence of irrelevant noise compared to longer videos.
1) Training
In our experiments, we used the ViT base architecture pretrained on the ImageNet-21K dataset. The learning rate of our model was set to 0.0609. We performed 8-frame sampling with a sampling interval of 32. To enhance the model’s robustness, we adopted the same sampling preprocessing settings as in [23], including random resizing of video frames, random brightness adjustments, and horizontal flipping. For the Kinetics-400 dataset, we conducted training for 18 epochs, and for the UCF-101 dataset, we trained the model for 25 epochs. During training on the Kinetics-400 dataset, we reduced the learning rate by a factor of 10 at the 11th and 16th epochs, while for the UCF-101 dataset, we reduced the learning rate by a factor of 10 at the 10th and 20th epochs. Additionally, after conducting ablation studies on our model’s most critical component, the shift module, we set the shift ratio to 4. Our model was trained on a Linux operating system using two Tesla T4 GPUs with 16GB of memory each. The CPU powering the machine is an Intel Xeon Gold 6242R.
2) Testing
In our testing process, we followed the same evaluation strategy as in [23] to ensure a fair comparison with previous results. During testing, we sampled 10 video frames from each test video and resized each frame to the size of the shortest side. Then, we performed cropping from different directions on each video frame to generate 3 sub-frames of size 224
C. Ablation Experiments
To validate the actual impact of the Shift module on the final action classification, we conducted gain studies by adding the Shift module. Additionally, to optimize the performance of the Shift module, we conducted research on the hyperparameters controlling the shift ratio, aiming to find the most suitable hyperparameter settings for our network.
1) Shift Module
The shift module is the component responsible for implementing the shift operation. It facilitates information interaction along the temporal dimension. Without the use of shift operations, multiple frames input into the network would be treated as independent, lacking effective interaction with other frames. This would result in the loss of the video’s dynamics and temporal context, potentially leading to a decline in the performance of video understanding. The shift operation addresses this issue, as described in Equation 6, by facilitating the interaction of partial information from the current frame with partial information from preceding and subsequent frames, while retaining essential information from the current frame. This enables the preservation of necessary spatial information while better integrating temporal information from different moments.
We conducted comparative experiments, comparing the ViT model with the added Shift module and the ViT model without the Shift module [23]. The results showed that the Shift module had a positive impact on video understanding tasks. On both datasets, the network model with the added Shift module outperformed the ViT model without the Shift module. The specific comparative results are shown in Fig. 5.
According to the data in Fig. 5, we observed that on the UCF-101 dataset, the addition of the Shift module resulted in a performance improvement of 1.61 percentage points compared to the ViT model without the Shift module. On the Kinetics-400 dataset, our model achieved a performance improvement of 1.53 percentage points compared to the ViT model without the Shift module. These results are presented based on TOP-1 accuracy.
2) Shift Ratio Setting
The shift ratio is a hyperparameter that controls the channel information interaction in the shift module. It determines the proportion of data in each token’s feature vector that will undergo shifting. As mentioned in Equation 6, since
To comprehensively compare the impact of different shift ratios on model performance and identify the optimal parameter settings, we conducted ablation experiments on the UCF-101 dataset. The specific experimental results are detailed in Figure 6.
Experimental results have demonstrated that setting the shift ratio to 4 is the optimal choice for action recognition tasks, as it helps to enhance the performance of the model.
D. Attention Visualization
The attentional effect of the network can be evaluated subjectively. In Fig. 7, we present the visualizations of our model’s application of attention mechanism on video clips. Specifically, through visualizing attention, we gain a better understanding of the reasoning process of our model in video comprehension tasks and verify its focus on crucial actions or objects.
Attention Visualization for Video Segments. The results indicate that our model pays particular attention to the relevant parts of the action subjects in the video during inference.
We utilized the visualization script provided in [30] to achieve the visualization functionality. Fig. 7 showcases video clips sampled from the Kinetics-400 dataset. In the first skiing clip, we can observe the model’s focus on the skier, a behavior that aligns closely with human attention. In the second dog-walking clip, our model attends to both the person and the dog, effectively ignoring the surrounding environment. This further substantiates our network’s ability to concentrate on the main subjects and perform accurate recognition.
E. Comparison to State-of-the-Art
1) Comparison on the Kinetics-400 Dataset
Our improved model not only enhances the performance of the ViT model in video recognition tasks but also demonstrates outstanding performance when compared to many state-of-the-art methods. Based on the results obtained from the ablation experiments, we compared the performance of our model with the state-of-the-art methods on the Kinetics-400 and UCF-101 datasets. These two datasets are standard datasets in the field of video recognition and have a high level of confidence. Specific comparison results for Kinetics-400 can be seen in Table 2.
Based on the comparison results presented, we observed that our model achieves comparable or even better accuracy than the state-of-the-art (SOTA) methods on the Kinetics-400 validation set. This finding validates the practical effectiveness of our approach, which combines the ViT network architecture with the shift module to achieve significant advancements in video recognition tasks. Moreover, the choice of introducing the shift module before the multi-head attention operation has also been validated. Furthermore, our model demonstrates excellent performance in terms of computational costs, as evident in the metrics compared in Table 2, including “Frames/Clip,” “GFLOPs
Specifically, compared to traditional general video recognition models (2D or 3D networks) such as AGPN [47], GC-TSM [40], and others, our model has achieved a significant improvement in accuracy. This indicates that the introduction of Transformer can accelerate the rapid development of video recognition. It’s noteworthy that under the same input video frame resolution (where each video frame is equivalent to a single image frame in a video), for example, I3D [16], S3D-G [33], our model requires the fewest training video frames and performs excellently. Additionally, under the same number of input video frames, as in TSM [29], Tokshift [23], our model, using the lowest image resolution (224
By comparing the computational complexity and the number of inference views of our model with those of others, such as VTN [24], TimeSformer [22], and LAPS [39], we demonstrate the effectiveness of our model in reducing computational load. Specifically, compared to TimeSformer, our computational load is reduced by 77.2%, with only a 0.93% decrease in accuracy. Similarly, in comparison to VTN, our computational load is reduced by 99.7%, accompanied by an accuracy drop of only 2.08%. Furthermore, the shift operation involves zero computational cost and zero parameters. Our model utilizes the same number of parameters as the ViT [23] model, and when compared to models like VTN and TimeSformer, which also employ the Transformer architecture, it indicates that our model requires fewer parameters. In summary, through comparisons with state-of-the-art methods, the effectiveness of our work has been robustly demonstrated.
Furthermore, it is worth noting that the Tokshift [23] network, which shares similarities with our model structure, utilizes two shift modules, whereas our model incorporates only one shift module, yet still achieves favorable results. In TSM [29], it was highlighted that although shift modules are zero-computation and zero-parameter structures, the cost of data movement they introduce remains non-negligible. Therefore, relatively speaking, our network demonstrates higher efficiency in practical applications, as it requires fewer shift modules while maintaining competitive performance.
2) Comparison on the UCF-101 Dataset
UCF-101 is widely recognized as a classic standard dataset for action recognition tasks, and many research works evaluate the performance of network models on small-scale datasets. To verify the performance of our model on smaller-scale datasets, we compared it with other SOTA models on the UCF-101 dataset. The specific results are shown in Table 3.
According to the results in Table 3, the model employed in this study demonstrates outstanding performance on the UCF-101 dataset. In the field of deep learning, convolution-based methods like Two-stream [12], TSN [15], I3D [16], and P3D [28] have shown excellent performance in recent years. However, with the introduction of attention mechanisms, traditional convolutional methods have gradually lost their advantages. Additionally, the transformer-based shift model used in this study exhibits significant advantages compared to other networks employing attention mechanisms, surpassing the performance of the 3DRseNet50-CS [41] model. Compared to TokShift [23], our model demonstrates smoother running performance and higher accuracy. It is noteworthy that, relative to recent models such as VideoMAE [38], Kang et al. [48], BS-2SCN [45], HPER-Net [46], Conv Transformer [52], SVFormer [53], our ViT-shift model continues to perform well on small datasets like UCF-101. This emphasizes the robustness and adaptability of our model, especially in scenarios with limited data.
Conclusion
In this paper, we proposed a novel approach called ViT-Shift for action recognition, which is based on the Vision Transformer (ViT) model with the incorporation of a shift module. Compared to existing research, our model demonstrates significant improvements in both computational efficiency and accuracy for video action recognition. Notably, our model introduces a zero-parameter and zero-computation shift module solely before the multi-head attention of the encoder. As a result, Compared to other action recognition models, our model has less computational effort and higher accuracy. We have validated the effectiveness of our model on the Kinetics-400 and UCF-101 datasets, where it shows remarkable performance. Our future work will focus on further enhancing the accuracy of action recognition models while maintaining low computational overhead.