Multi-Stream Single Network: Efficient Compressed Video Action Recognition With a Single Multi-Input Multi-Output Network

Compressed video action recognition classifies actions using multiple features stored in compressed videos to omit the decoding process for RGB frames and shorten the computation time. Previous methods mostly used multiple networks to process compressed video features and explored the use of lightweight networks without affecting accuracy to reduce the computational complexity further. We have focused on another approach that uses only one network to reduce computational complexity. Our previous study proposed the MussNet model, which consists of independent subnetworks within a single network instead of multiple networks. The subnetworks classify compressed video features independently with a feedforwarding step of a single network and achieved competitive accuracy against previous studies with lower computational complexity. The remaining issue of the MussNet model is how to fuse the independently processed compressed video features. The current MussNet model makes independent predictions from each input and only averages them to fuse the inputs. However, recent studies have shown that intermediate fusion, which fuses features inside the networks, improves accuracy. This study proposes the EFS module that extends the MussNet model into intermediate fusion by disentangling and aggregating the features of the same videos in the hidden vectors while keeping the individual subnetworks. Our experiments show that the EFS module improves the MussNet model’s accuracy by 0.4 points for UCF-101 and 1.0 points for HMDB-51, while the additional GFLOPs are only 1% of the MussNet model. These accuracy scores are also competitive against previous studies while keeping one of the lowest computational complexity.


I. INTRODUCTION
Video understanding is a research field that leverages machine learning to analyze video data.This technology has a wide range of applications, including automated driving, surveillance systems, and video generation.A typical task in video understanding is action recognition, which involves classifying human actions from short video clips.In traditional studies, handcrafted descriptors like dense The associate editor coordinating the review of this manuscript and approving it for publication was Juan Wang .trajectories [1] and improved dense trajectories [2] were developed to classify actions.Recent studies have used deep learning models to classify human actions directly from RGB frame sequences [3], [4], [5].Nevertheless, accessing RGB frames requires decoding because most videos are compressed for efficient storage.While some studies have even incorporated optical flow as an auxiliary input to boost accuracy [6], [7], [8], [9], [10], [11], it also has the same limitation because it is computed from RGB frame sequences.This decoding limits the deployment of action recognition models on mobile or edge devices.Videos frequently contain redundant information, such as backgrounds, and video compression reduces such redundancy by converting RGB frames into different features, including I-frames, motion vectors, and residuals.The I-frames are stored as images, whereas the motion vectors and residuals represent only the changes from previous RGB frames.Compressed video action recognition directly classifies actions from compressed video features, as depicted in Fig. 1.We can omit the decoding process and reduce the computational cost by using compressed video features as inputs.Wu et al. [12] showed that this method achieves competitive classification accuracy against conventional RGB-based action recognition but with lower computational complexity defined by floating-point operations (FLOPs).Some studies have leveraged the low computational complexity of compressed action recognition to deploy action recognition models on mobile and edge devices [13], [14].
Most conventional methods for compressed video action recognition use multiple networks to process I-frames, motion vectors, and residuals of compressed video features.Some studies optimized these networks independently and fused their predictions for inference [12], [14], [15], [16], [17].Others jointly optimized them by weakly connecting their hidden layers for performance improvement [13], [18].Such models require the computational complexity of multiple networks.
The computational complexity of multiple networks may be unnecessary because most parameters in deep networks are unused and removable after training, as shown in various studies [19], [20], [21].For the efficient ensemble of image classification, the multi-input multi-output (MIMO) model utilized the unused parameters by creating independent subnetworks within a single network [22].The subnetworks process multiple images independently with one feedforwarding step of the parent network and make different predictions from the images.As a result, the MIMO model achieved competitive accuracy against multiple networks while reducing the computational complexity.
Inspired by the MIMO model, we have proposed a multi-stream single network (MussNet) for efficient com-pressed video action recognition in our previous study [23].This model trains a single network in the MIMO manner and creates three independent subnetworks within a single network.The subnetworks in the MussNet model independently process I-frames, motion vectors, and residuals instead of the multiple networks used in the previous methods.The proposed model achieved competitive accuracy against multiple networks while reducing the overall computational complexity.
The limitation of the current MussNet model is that this model only processes inputs independently and cannot fuse the input features inside the network.Recent studies have shown that intermediate fusion, which fuses hidden vectors of multiple networks at some intermediate points of the feedforwarding step, improves the accuracy of the multistream models [5], [10], [11].However, in the MussNet model, features of I-frames, motion vectors, and residuals are contained in the shared hidden vectors, and how to perform the intermediate fusion using the subnetworks is non-trivial.
To overcome this limitation, we expand our previous study by proposing a novel module named the Extract, Fuse, and Scale (EFS) module.The EFS module learns to disentangle and fuse the features of I-frames, motion vectors, and residuals from the shared hidden vectors, allowing the MussNet model to perform intermediate fusion.In addition, this module is designed to use only a few additional computational complexities for the feedforwarding step.The MussNet model with the EFS module improves accuracy from the original MussNet model while maintaining the efficiency of a single network in terms of computational complexity.The contributions of this study can be summarized as follows: 1) We propose the EFS module to improve the accuracy of the MussNet model with a few additional computational complexities.2) We experimentally show that the MussNet model with the EFS module achieves competitive accuracy against our multiple network-based baselines while reducing the computational complexity.

3) We analyze the EFS module and clarify that extending
the MussNet model into intermediate fusion improves accuracy.We also show that the EFS module succeeds in disentangling and fusing features from the shared hidden vectors with 1% GFLOPs of the original MussNet model.

II. RELATED WORK A. COMPRESSED VIDEO ACTION RECOGNITION
Pioneering studies [24], [25] on compressed video action recognition only used motion vectors as an easy-to-use alternative to optical flow that needs expensive computation to obtain.These studies still used decoded RGB frames and did not use compressed video features other than motion vectors.Wu et al. [12] first proposed the CoViAR method that classifies videos using only compressed video features.They employed three 2DCNNs corresponding to compressed video features and trained them independently.
After training, the final prediction was computed by averaging the predictions of the three different networks.Li et al. [26] showed that compressed video action recognition was available under the practical scenario that compressed videos are transmitted from other devices and some packets are dropped.Some studies focused on the efficiency of compressed video action recognition and extended it into different tasks, such as real-time object tracking [27] and facial expression recognition [28].Subsequent studies have developed more efficient or effective compressed video action recognition methods by replacing backbone networks with different lightweight networks and employing additional components to maintain the accuracy of the CoViAR method.For example, CV-C3D [17] and MFCD-Net [29] used 3DCNNs; Wu et al. [15] and Guo et al. [30] used ResNet18 [31] as their backbone network and trained it using knowledge distillation [32]; TTP [14] combined MobileNetV2 [33] with an efficient yet effective fusion method.Other studies estimated optical flow from motion vectors and residuals and used the estimated optical flow to improve accuracy.The DMC-Net method [16] trained the optical flow estimator in a supervised manner using actual optical flow for training.The subsequent SIFP method [18] developed an unsupervised approach to train the optical flow estimator without actual optical flow.The proposed MussNet provides another approach for efficient compressed video action recognition.

B. EFFICIENT ENSEMBLE
Our method is inspired by the recent ensemble method that can estimate the uncertainty of predictions or improve out-of-distribution robustness by feeding the same inputs (e.g., images) into multiple networks and fusing their predictions.The problem with the ensemble methods is the expensive computation and memory costs for training and testing multiple networks.Various approaches, such as Monte Carlo Dropout [34] and Snapshot [35], were proposed to address this problem.Havasi et al. proposed the MIMO method [22], which uses a single MIMO network instead of multiple single-input single-output networks.Unlike other methods, their method processes multiple inputs with one feed-forwarding of a single network.They showed that independent subnetworks are obtained in a single network through MIMO learning.We extend the MIMO method into compressed video action recognition to obtain a single network that processes compressed video features simultaneously.

III. METHOD A. BACKGROUND 1) COMPRESSED VIDEO FEATURES
This study used the MPEG-4 Part 2 format [36] to compress videos, following the previous study [12].This format arranges RGB frames into groups of pictures (GOPs), where each GOP contains a fixed number of frames and starts with an I-frame, followed by several P-frames.The codec stores I-frames as standard RGB images and P-frames as the changes in the RGB values from the previous frames.Specifically, the differences between P-frames and the previous frames are represented by motion vectors and residuals.The motion vectors represent the coarse motion of 8 × 8 macroblocks from previous to current frames.The residuals represent pixel-wise differences in the RGB values between previous frames transformed by the motion vectors and current frames.Compressed video action recognition classifies videos from the I-frames, motion vectors, and residuals.

2) NAÏVE FUSION METHODS FOR COMPRESSED VIDEOS
To utilize multiple inputs, including compressed videos, it is crucial to effectively fuse their information to obtain accurate predictions.The design of the fusion method has been extensively studied as it significantly affects classification performance.This study focused on three naïve fusion methods: early, late, and intermediate fusion.
Early fusion, depicted in Fig. 2-(a), is a straightforward method of fusing compressed video features.This method concatenates compressed video features in advance and feeds the concatenated features into a single network.The advantage of the early fusion method is that it only requires one network to process compressed video features, leading to lower computational costs than other fusion methods.However, early fusion often yields poorer classification performance than other methods, which limits its use in action recognition.
Late fusion, depicted in Fig. 2-(b), is another simple fusion method for compressed videos, which has been used in many previous methods [12], [14], [16], [17], [26].In compressed video action recognition, the late fusion method uses three networks and independently classifies actions from each compressed video feature using these networks.The final prediction is obtained by averaging the three predictions.Late fusion can significantly improve classification performance compared to early fusion.However, late fusion only linearly fuses the predictions from compressed videos and cannot nonlinearly fuse compressed video information, leaving room for further improvement in accuracy.
Intermediate fusion, depicted in Fig. 2-(c), is a more complex fusion method that fuses input features information during hidden layer processes.This method also uses three networks and trains them to classify each compressed video feature, similar to the late fusion method.Furthermore, intermediate fusion aggregates the hidden layer outputs in networks and passes their information to one of the networks, enabling the nonlinear fusion of compressed video features.In conventional methods, SIFP [18], MEACI-Net [37] and He et al's method [13] employed the intermediate fusion to improve accuracy.

B. MULTI-STREAM SINGLE NETWORK
We aim to simultaneously process I-frames, motion vectors, and residuals using a single network for efficient compressed video action recognition.However, early fusion, which trains a single network to classify actions from the concatenated Iframes, motion vectors, and residuals, leads to poor accuracy as described in Sec.III-A2.
To overcome this problem, we proposed the MussNet model.This model was inspired by the MIMO model, originally proposed for the efficient ensemble [22].The original MIMO model performs an ensemble of predictions from the input with the same feature type.Instead, we perform an ensemble of predictions from different features, Iframes, motion vectors, and residuals.As depicted in Fig. 3, our model is a single network with three prediction heads corresponding to I-frames, motion vectors, and residuals, respectively.The prediction heads are trained to classify videos only from corresponding features by simultaneously feeding I-frames, motion vectors, and residuals extracted from different videos into MussNet.This training encourages MussNet to have independent subnetworks that classify videos from one of the compressed video features using the corresponding prediction head.The created subnetworks are available for late or intermediate fusion.

1) TRAINING
be a set of N labeled videos for training, where {x I i , x M i , x R i } represents I-frames, motion vectors, and residuals of the i-th video x i and t i is the corresponding label.During training, the MussNet model receives I-frames, motion vectors, and residuals from different videos, i.e., {x I i , x M j , x R k } where i ̸ = j ̸ = k are given as inputs.Note that a set of videos {x i , x j , x k } is randomly chosen from the dataset D every optimization step.Then, the MussNet model returns three probability distributions p θ (y x R k ) using three prediction heads.The network parameters θ are optimized to minimize the following loss: This optimization promotes the independence of model predictions because each compressed video feature does not provide significant information for predicting the labels of the other two features, as the classes of input videos may differ.For example, x I i contains only information about the corresponding label t i , while t j and t k are not predictable from x I i .The above training procedures are summarized in Algorithm 1.

2) INFERENCE
During the inference phase, the MussNet model makes predictions from the same video, as depicted in Fig. 3-(b).The final prediction p θ (y|x I i , x M i , x R i ) is computed by averaging the predictions from each compressed video feature as follows: To extract the features of x M i and x R i from h lim and h noi , the extract submodule, which is a shallow neural network, receives one of the hidden vectors h lim and h noi , and returns two vectors as follows: similar to prediction heads of the MussNet model.Here, ĥM i and ĥR i are expected to contain the motion vector and residual features from x i after optimization.
The generated vectors ĥM i and ĥR i are concatenated and processed by the fuse submodule as follows: where ⊕ denotes an operation that concatenates vectors along their channel dimension.Here, ĥi has the same shape as h ijk and holds the features of motion vectors and residuals of x i . 1 The subscript indicating the layer at which the hidden vector h ijk is output is omitted for simplicity of notation.
To incorporate the fused features ĥi into h ijk , we can take the element-wise addition of ĥi and h ijk .However, we found that such a simple fusion decreases the accuracy of the MussNet model.The reason for the accuracy decrease would be that ĥi disturbs the features of x j and x k contained in h ijk and makes the optimization difficult.
To solve this problem, we introduce the scale submodule (Fig. 4-(c)) inspired by squeeze-and-excitation [38].This submodule generates scaling weights for every channel of ĥi from h ijk .Each scaling weight is normalized into the range of [0, 1] using the sigmoid function σ (x) = 1/(1 + exp(−x)), and weaken the signals of ĥi to maintain the features of x j and x k in h ijk .
In this study, the scale submodule consists of global average pooling (GAP), followed by multi-layer perceptrons (MLPs).GAP reduces the number of elements of h ijk by averaging it along the height and width dimensions; therefore, the following MLPs can transform h ijk to scaling weights with low computational cost.
By using the scale submodule and outputs of the fuse submodule ĥi , the hidden vector h ijk is updated as follows: Now, h ijk has the intermediately fused features of all the compressed video features from x i , in addition to the features of the motion vector from x j and the residual from x k .The updated h ijk is processed by the following hidden layers for intermediate fusion.The EFS module also updates h lim and h noi using different sets of videos.We show the training algorithm for the EFS module in Algorithm 2.
Inference: The EFS module does not need multiple input videos for inference, similar to the MussNet model.When x R i } are given as inputs of the MussNet model, the hidden vector can be represented as h iii and outputs of extract submodule returns the following vectors: where both vectors ĥM i and ĥR i are available as inputs of the fuse submodule.We showed the procedure of the EFS module during inference in Algorithm 3.

D. INFORMATION ROUTING
To update other hidden vectors in addition to h ijk , we need additional videos sampled from the dataset D. For example, to update h lim , we need features of motion vectors and residuals on x l .However, the hidden vectors from the additional videos also need extra videos; therefore, the straightforward implementation of the EFS module optimization requires many input videos to update the MussNet model and the EFS module once.The MussNet model also needs additional computations to process such videos, leading to inefficient optimization.
In this study, we propose the information routing technique to overcome this problem (Fig. 5).This technique is available under mini-batch learning with an acceptable batch size (e.g., {(x I i , t i ), (x M j , t j ), (x R k , t k )} ∼ D // Randomly sample examples from the dataset. 3: (p θ (y 32 and 64).The extra motion vector and residual features to optimize the EFS module are collected from the minibatches; therefore, we do not need extra videos except for mini-batches to optimize the MussNet model and the EFS module.
Let B = {x 1 , x 2 , . . ., x K } be a mini-batch of videos with batch size K .We use mini-batches of I-frames, motion vectors, and residuals for compressed video action recognition, where B I , B M , and B R denote such mini-batches as follows: In our optimization, we require sets of compressed video features sampled from different videos.To create such sets from B I , B M , and B R , we shuffle elements of B M and B R with different orders {φ M 1 , φ M 2 , . . ., φ M K } and {φ R 1 , φ R 2 , . . ., φ R K } 2 as follows: Algorithm 3 Procedure of the EFS Module for Inference Require: Hidden vector h iii Ensure: For example, when The MussNet model concatenates B I , B M , and B R along the channel dimension and transforms them into the mini-batch of hidden vectors h as the inputs of the EFS module.While the EFS module also requires other hidden vectors, which contain features of motion vectors and residuals of {x 1 , x 2 , . . ., x K }, these vectors are already computed in h.To address the features of motion vectors and residuals, we feed h into the extract submodule and generate mini-batches of ĥM = { ĥM Then, elements of ĥM and ĥR are permutated as follows: Here, the i-th elements of ĥM and ĥR contain motion vector and residual features of x i .By replacing h ijk with h, ĥM i with h M and ĥR i with h R in Eq. 5 and Eq. 6, we can complete procedures of the EFS module.The general algorithm of the information routing is summarized in Algorithm 4.

IV. EXPERIMENTS A. DATASETS
We evaluated the proposed method on two widely used public datasets for action recognition: UCF-101 [39] and HMDB-51 [40].
The UCF-101 dataset consists of 13,320 video clips across 101 action categories.All clips are collected from YouTube and have a fixed 25 FPS with a resolution of 320 × 240.This dataset provides train-test splits for three-fold crossvalidation, where each split uses about 9,600 videos for training and others for validation.Note that each split specifies a slightly different number of videos for training and validation, but the number of clips per action category is almost balanced.
The HMDB-51 dataset consists of 6,770 video clips across 51 action categories, where each action category has a minimum of 101 clips collected from various sources, including movies and public databases.Although video clips have various FPS and resolutions, we converted them to 25 FPS and a resolution of 320 × 240.This dataset  ĥ ← Fuse( ĥM , ĥR )

23:
h ← hidden_layer(h) 24: end for 25: y ← output_layer(h) TABLE 1. Partial network architecture of the extract submodule corresponding to one of the compressed video features.The extract submodule has two networks to extract motion vector and residual representations from so that the output size of the extract submodule is also provides train-test splits for three-fold cross-validation.Each split specifies 70 clips for training and 30 clips for validation per category.Therefore, the number of clips per action category in train-test splits is perfectly balanced with a ratio of 70:30.

B. IMPLEMENTATION DETAILS
We employed ResNet18, ResNet34, and ResNet50 [31] with temporal shift modules (TSM) [8] as backbone networks.These networks consist of 2D convolution layers, and TSMs adapt them for video processing without increasing the number of parameters or computational complexity by shifting parts of hidden vector channels to the future and past frames.In our experiments, TSMs are placed before every ResNet block.The EFS modules consist of 1 × 1 convolution layers and are placed after the first, second, and third ResNet blocks.The fuse submodule was one convolution layer without any activation functions or normalization layers, while extract and scale submodules are summarized in Table 1 and Table 2.
To conduct our experiments, we resized all videos to a resolution of 320 × 240 and compressed them using the MPEG4 Part-2 format.We set the size of GOPs to 12, meaning each GOP contains an I-frame and 11 P-frames.During the training phase, we randomly selected five GOPs from each video and chose an I-frame and a P-frame from each GOP.We applied random cropping and horizontal flipping to the selected five I-frames and P-frames to obtain patches with a 224 × 224 resolution.During the evaluation phase, we uniformly sampled five I-frames and P-frames and used center cropping to obtain patches of the same size as in the training phase.
The backbone networks were pre-trained on the ImageNet dataset [41], using the MIMO method with three prediction heads, thus creating multiple subnetworks in the backbone networks.After pre-training the backbone networks, the MussNet was constructed by adding randomly initialized TSMs and EFS modules.Then, the networks were optimized using stochastic gradient descent with Nesterov's momentum [42].The learning rate η was scheduled using the cosine decay schedule [43] with 10 epoch warmup, defined as: where T is the total number of epochs, t is the current epoch, and η peak is the peak value of learning rate.We set η peak = 0.1 for non-pretrained parameters and η peak = 0.01 for pretrained parameters.We summarized hyperparameters for optimization in Table .3.

C. BASELINE METHODS
We will show that MussNet can achieve similar accuracy to the late and intermediate fusion methods while keeping the efficiency of a single network.Hence, we selected the naïve early, late, and intermediate fusion methods described in Sec.III-A2 as our baseline methods.For a fair comparison, the same backbone networks of MussNet are used for those baseline methods in our experiments.The early fusion baseline is equivalent to the MussNet trained with compressed video features extracted from the same videos during training.The late fusion baseline uses three backbone networks to classify videos from each compressed video feature.The intermediate fusion baseline is based on the late fusion baseline but has mono-directional information paths from motion vector and residual networks to the I-frame network.The mono-directional information path concatenates the outputs of the arbitrary layers of the motion vector and residual networks, transforms them by a 1 × 1 convolution layer, and adds the transformed representations to the outputs of the I-frame network layers.
The mono-directional information path is placed after the first, second, and third ResNet blocks in the same way as the MussNet with EFS modules.While the late and intermediate fusion baselines have three predictions from each network, the final prediction is computed by averaging these predictions.
We used the standard ImageNet pre-trained parameters provided by the PyTorch library3 [44] to initialize the baseline networks.Then, baseline networks were fine-tuned using the same optimization strategy of Sec.IV-B.

D. COMPARISON WITH BASELINES
We trained the MussNet with EFS modules and our baselines and presented the training results in Fig. 6.By comparing our baselines, we found that early fusion performed significantly worse than the other baselines in terms of accuracy.Specifically, the early fusion accuracies were approximately 10 points lower for the UCF-101 dataset and 15-20 points lower for the HMDB-51 dataset than those of the other baselines.For more details, the accuracy improvement of late fusion of {ResNet18-TSM, ResNet34-TSM, ResNet50-TSM} against early fusion was {12.2, 11.7, 9.7} points for the UCF-101 dataset and {17.9, 16.4, 15.6} points for the HMDB-51 dataset.These findings demonstrate that early fusion is unsuitable for the effective classification of compressed video features with a single network.In addition, the intermediate fusion achieved better accuracy than early and late fusion methods in most cases; incorporating an intermediate fusion into our model is expected to promote better accuracy.
Comparing the proposed method to the baselines with the same network architecture, the computational complexity of the MussNet was less than half that of late fusion and intermediate fusion, as the MussNet only utilizes one network for classification.However, despite this, the proposed method achieved comparable accuracy to late and intermediate fusion.These findings indicate that the proposed method is better suited for training a single network for compressed videos than early fusion.Furthermore, employing three ResNet18-TSMs consumes as much computational complexity as a single ResNet50-TSM.Comparing the proposed method of ResNet50-TSM with the intermediate fusion baseline of ResNet18-TSM in Fig. 6, we found that the proposed method resulted in better accuracy.This result illustrates that the MussNet with EFS modules also improves the accuracy of nave late fusion or intermediate fusion models without increasing computational complexity.

E. ANALYSIS OF THE MUSSNET MODEL AND THE EFS MODULE
We conducted ablation experiments to evaluate how much our proposed methods contribute to the efficiency and accuracy of the MussNet model.This section presents the validation results based on the ResNet34-TSM model using FIGURE 6.Comparison between the proposed method and baselines with respect to the accuracy and computational complexity (GFLOPs).We used ResNet18-TSM, ResNet34-TSM, and ResNet50-TSM as backbone networks for the proposed method and baselines (ResNet18-TSM consumes the lowest GFLOPs).We report the average scores of all train-test splits.TABLE 4. The experimental results of ablation studies using ResNet34-TSM as a backbone network.
UCF-101 and HMDB-51 datasets.The reason for the choice of ResNet34-TSM in the ablation study is the good balance of trade-offs between accuracy and computational complexity.All experimental results of the ablation study are summarized in Table 4.
From our experiments, we found the following insights of the MussNet model and the EFS module.

1) CREATING SUBNETWORKS IN THE MussNet MODEL IS THE MOST IMPORTANT FOR ACCURACY
The MussNet model is optimized using compressed video features from different videos to create subnetworks in a single network.We first evaluated whether creating subnetworks in the MussNet model contributes to the classification performance or not.For this evaluation, we trained the MussNet model using compressed video features from the same videos.
As shown in w/o MIMO training (Early fusion) of Table 4, when the MussNet model was optimized using compressed video features from the same videos, it only achieved 77.5% accuracy for UCF-101 and 44.0%accuracy for HMDB-51.These results were 10.1 points and 19.0 points worse than the full method.This performance degradation was worst in the experimental results of our ablation study; thus, we concluded that creating subnetworks was the most important for the accuracy of the MussNet model.

2) EXTENDING THE MussNet MODEL INTO INTERMEDIATE FUSION IMPROVES ACCURACY
We also evaluated whether extending the MussNet model into intermediate fusion improves accuracy.For this evaluation, we trained the MussNet model without the EFS modules and summarized the experimental results in w/o EFS module of Table 4.
As shown in the results, the MussNet model without EFS modules achieved 87.2% accuracy for UCF-101 and 62.0% accuracy for HMDB-51.Comparing these results with the full method, the EFS modules improved 0.8 points for UCF-101 and 1.0 points for HMDB-51.This experimental result showed that integrating the EFS module into the MussNet model improved the accuracy of the MussNet model.
However, this experiment is not enough to claim that extending the MussNet model into intermediate fusion 20992 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.improves accuracy because there is a possibility that the additional computations of the EFS module improved accuracy.To remove this possibility, we used the inference computation of the EFS modules for training and disabled intermediate fusion while keeping the computation proce-dures.Specifically, we replaced Eq. 5 with the following equations: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 6.
Comparison between the proposed method and baselines with respect to action recognition performance and computational complexity (GFLOPs).We report the results of the split 1 (S1), split 2 (S2), split 3 (S3), and average (Avg.)scores of all splits.The parameters and GFLOPs are computed on UCF101.
for this ablation study.

3) THE EFS MODULE ONLY INCREASES 1% GFLOPs OF THE MussNet MODEL
The comparison of the full method and w/o EFS module also showed that the EFS module only increases 1% GFLOPs of the MussNet model.Specifically, the full method used 29.3 GFLOPs, which was only 0.3 GFLOPs more than that of w/o EFS module.This observation showed that the EFS module is available without reducing the efficiency of a single network-based approach.

4) THE SCALE SUBMODULE IS ESSENTIAL IN THE EFS MODULE
We evaluated whether the scale submodule contributed to accuracy.For this evaluation, we removed the scale submodule from the EFS module and only used the fuse submodule output ĥi to update the hidden vector h ijk .Specifically, we updated the hidden vector h ijk as instead of Eq. 6.We trained the model on UCF-101 and HMDB-51 and summarized the training results in w/o the Scale submodule of Table 4. From the results, we found that removing the scale submodule only achieved 82.9% accuracy for UCF-101 and 49.7% accuracy for HMDB-51, which are 4.7 and 13.3 points lower than the full method.Because the reported accuracy is worse than the MussNet model without the EFS module, the scale submodule is essential for the EFS module to improve the accuracy of the MussNet model.

5) THE LATER EFS MODULES LEARN BETTER DISENTANGLEMENT OF FEATURES
We also analyzed whether the outputs of the extract submodules only hold features from either motion vectors or residuals.
For this analysis, we fixed our model's parameters and trained new classifiers that predict labels corresponding to I-frames, motion vectors, or residuals from the outputs of extract submodules h M * and h R * ; i.e., we trained six classifiers per the extract submodule.Each classifier consisted of a global average pooling layer, followed by three fully connected layers with 1,024 nodes and ReLU activations.We applied a similar optimization strategy to Sec.IV-B but fixed the total number of epochs at 150 for both UCF-101 and HMDB-51 datasets.If the extract submodules completely disentangled motion vector and residual features (i.e., made these features inaccessible), only the labels of corresponding features will be predictable, while the others are not.The experimental results are summarized in Fig. 7.
From our experiments, we found that the 2nd and 3rd extract submodules disentangled features of motion vectors and residuals from the hidden vectors.However, the disentanglement of the 2nd submodule was not as clear as that of the 3rd submodule.In addition, the 1st submodule failed to disentangle the features because the accuracy of the motion vector was consistently better than the residuals.
These results indicate that our proposed method learns to extract the corresponding features from hidden vectors for intermediate fusion.The 1st submodule failed to disentangle the features because the transformation of the 1st ResNet block was not sufficient, and the intermediate fusion at the 1st EFS module does not contribute to the classification compared with the 2nd and 3rd submodules.When the extract submodules failed to disentangle features, the EFS modules fuse noisy features into subnetworks and make the MussNet 20994 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.model optimization challenging.Therefore, improving the extract submodules seems to be a promising direction for effective intermediate fusion of a single network-based approach.

F. COMPARISON WITH PREVIOUS STUDIES
Finally, we compared our method with conventional compressed video action recognition methods and summarized the results in Table .5. This result showed that the MussNet model with the EFS modules was one of the most efficient methods in terms of computational complexity, even when we used ResNet50-TSM as our backbone.Only MTFD was more efficient than the MussNet model with ResNet34-TSM and ResNet50-TSM.However, the MussNet model with ResNet18-TSM was more efficient while using the same backbone network of MTFD.It was because MTFD still used three ResNet18 to classify compressed video features, while the MussNet model only used one ResNet18 to classify compressed video features.
The accuracy comparison shows that our model was competitive against most previous methods.This result indicates that a single network can achieve the same level of accuracy as multiple networks.However, recent models such as SIFP, TEMSN, and MTFD were more accurate than the MussNet model while keeping the efficient computational complexity.One reason for the accuracy gap between the MussNet model and these previous methods is that they introduce various techniques, such as knowledge distillation, while the MussNet model is only optimized using the crossentropy loss.
From this comparison, we consider that a single network-based method is a promising direction for efficient compressed video action recognition.The advantage of the single network-based method against previous methods is the efficiency even when relatively large networks (e.g., ResNet50-TSM) are used as the backbone network.However, the MussNet model does not reach the accuracy of stateof-the-art efficient compressed video action recognition methods.

V. DISCUSSION
While the MussNet was one of the most efficient compressed video action recognition models in terms of computational complexity, our comparison of the MussNet with previous studies revealed that its accuracy does not match those of state-of-the-art methods even when EFS modules are introduced.We believe that the reason for the inferior performance is that we only used the standard cross-entropy loss for training, whereas state-of-the-art methods employ cross-entropy loss as well as other modules or loss terms to improve their classification performance.For instance, DMC-Net and SIFP employed shallow modules that estimated optical flow from motion vectors and residuals and used the estimated optical flow as an additional input.Wu et al. used knowledge distillation from powerful yet heavy networks to optimize lightweight networks.Given that our model only uses subnetworks instead of multiple networks, introducing most of the modules or loss terms used in previous studies is feasible.However, the MussNet models are required to create subnetworks within the backbone network, and directly introducing these modules and loss terms may make the training unstable.Therefore, introducing the additional modules and loss terms to improve the accuracy of the MussNet model remains in future work toward efficient yet effective compressed video action recognition methods.
The limitation of the proposed method is that it requires the backbone networks to have sufficient capacity to hold multiple subnetworks.While we used the ResNet family, which consists of standard convolution layers, as our backbone network, some modern networks use more sparse layers, such as depthwise separable convolution [33] and MBConv [45], [46] to achieve more accurate yet efficient predictions.These sparse layers have fewer parameters than the standard convolution layers and can also have reduced capacity.Such modern networks can limit the power of the MussNet; hence, it is necessary to improve the MussNet to utilize its limited capacities.The recently proposed techniques to improve MIMO-based ensembles [47], [48], [49] may be helpful in resolving this problem, but it remains a contemplation for future work.

VI. CONCLUSION
In this study, we introduce the MussNet and EFS modules for efficient compressed video action recognition.The proposed method requires only one network to process compressed video features, reducing overall computational complexity while maintaining the classification accuracy of naïve late and intermediate fusion baselines.
Our experiments demonstrated that the MussNet achieved comparable classification performance to previous compressed video action recognition methods but with significantly lower computational complexity.However, our model was only optimized by cross-entropy loss, while some previous studies used more effective loss terms and modules.Furthermore, the MussNet is one of the MIMO-based methods, and techniques that improve the MIMO method are also helpful for the MussNet.In future work, we will integrate such additional loss terms, modules, and techniques into our method and improve accuracy.

APPENDIX A PRACTICAL IMPLEMENTATION OF MussNet
In Listing 1, we show PyTorch-like implementation of the MussNet model and the EFS module.

APPENDIX B TABULAR RESULTS OF COMPARISON
We show the tabular results of comparison with baselines in Table 6.

FIGURE 1 .
FIGURE 1. Illustration of compressed video action recognition and traditional RGB frame-based action recognition.Compressed video action recognition is more efficient than RGB frame-based approach because decoding is not required to obtain I-frames, motion vectors, and residuals from compressed video files.

FIGURE 2 .
FIGURE 2. Naïve fusion methods for compressed video features.

Algorithm 1
Train the MussNet Model Require: Network parameters θ, training dataset D, learning rate η Ensure: Updated parameters θ 1: while θ does not converge do 2:

FIGURE 5 .
FIGURE 5. Information routing for the efficient optimization of the MussNet model with the EFS modules.The colors of the arrows emphasize which information of compressed video features are focused on.We omit the EFS module procedures for {x 4 , x 5 , . . ., x 12 } for the best visualization, but all inputs are processed in actuality.

FIGURE 7 .TABLE 5 .
FIGURE 7. Accuracy of action recognition from outputs of each extract submodule.The bar colors show types of compressed video features with labels to be estimated by classifiers.

LISTING 1 .
PyTorch-like implementation of the MussNet model and the EFS module.

)
C. EXTRACT, FUSE, AND SCALE MODULETo extend the MussNet model into intermediate fusion, we develop the extract, fuse, and scale (EFS) module depicted in Fig.4.The EFS module consists of shallow networks and is placed after any hidden layers.For intermediate fusion, the EFS module aggregates features of I-frames, motion vectors, and residuals come from the same videos.However, to construct subnetworks during training, the MussNet with EFS module has to simulateously process compressed video features from different videos in the same way as the case without the EFS module.To satisfy these two demands of intermediate fusion and independent processing, we adopt a strategy to aggregate the intermediate features in the EFS module from multiple feedforwarding steps exclusively containing multiple compressed features of the same videos.Let h ijk be one of the hidden layer outputs in the network generated from x I i , x M j , and x R k .1Focusing on one of the input videos, x i , h ijk only has I-frame features of x i , but features of motion vectors and residuals in x i are missing.Therefore, we must take the missing features from other hidden vectors for intermediate fusion.

TABLE 2 .
Network architecture of the scale submodule.

TABLE 3 .
Table of hyperparameters for the MussNet model optimization.
The experimental results are summarized in w/o fusing the extract submodule's outputs of different inputs of Table.4. As a result, disabling intermediate fusion achieved 81.7% accuracy for UCF-101, and 50.6% accuracy for HMDB-51, respectively.Comparing the experimental results with those of w/o EFS module, we found that just introducing computations while keeping late fusion worsened accuracy rather than improved.It means that, as we claimed, extending the MussNet model into intermediate fusion is helpful in improving accuracy.