Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

Video classification researches that have recently attracted attention are the fields of temporal modeling and 3D efficient architecture. However, the temporal modeling methods are not efficient or the 3D efficient architecture is less interested in temporal modeling. For bridging the gap between them, we propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D. The T-OSA is devised to build a feature hierarchy by aggregating temporal features with different temporal receptive fields. Stacking this T-OSA enables the network itself to model short-range as well as long-range temporal relationships across frames without any external modules. Inspired by kernel factorization and channel factorization, we also design a depthwise spatiotemporal factorization module, named, D(2+1)D that decomposes a 3D depthwise convolution into two spatial and temporal depthwise convolutions for making our network more lightweight and efficient. By using the proposed temporal modeling method (T-OSA), and the efficient factorized component (D(2+1)D), we construct two types of VoV3D networks, VoV3D-M and VoV3D-L. Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both Something-Something and Kinetics-400. Furthermore, VoV3D shows better temporal modeling ability than a state-of-the-art efficient 3D architecture, X3D having comparable model capacity. We hope that VoV3D can serve as a baseline for efficient video classification.


Introduction
Recently, many works [18,12,26,35,2,30] have been studied to handle visual tempo, or temporal modeling, for video classification.Unlike 2D image classification, video classification should distinguish visual tempo variation as well as its semantic appearance.In other words, appearance information alone is not sufficient to distinguish moving something up vs. down or walking vs. running, which requires to capture visual tempo variations.Thus, effectively modeling visual tempo is a key factor for video classification.
Previous works [30,12,18,17] for temporal modeling select 2D CNN architecture due to its efficiency rather than 3D CNN one, which usually process per-frame inputs and aggregate these results to produce a final output by adopting temporal shift module [18] or motion information embedding module [12,26].However, these methods depend heavily on the 2D ResNet [8] backbone, which is neither lightweight nor efficient compared to the state-of-the-art efficient 2D CNN models [26,9,27].3D CNN based temporal modeling methods [2,35] are also proposed to construct input frame-level pyramid [4] with different input frame rates or feature-level pyramid [35] with dynamic visual tempos modeling.However, these methods require extra model capacity for adding a separate network path or a fusion module.In short, since previous works are addon style modules on top of the backbone network, they are constrained under the backbone.
Another research that has recently attracted attention for video understanding is to build efficient network architectures [14,28,3].These works exploit 3D depthwise convolution for reducing model parameters and computational cost as 2D efficient CNN architectures [10,23,9,26,37,20,27] replace the convolution with a depthwise convolution and a pointwise convolution.This depthwise separable convolution is called a kind of channel factorization.However, these 3D efficient networks only focus on building architectures and do not consider the temporal modeling.
For addressing these issues, in this work, we propose an efficient and effective temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and a depthwise factorized component, D(2+1D).The T-OSA is devised to build a feature hierarchy by aggregating temporal features with different temporal receptive fields.Stacking the T-OSAs enables the network itself to model short-range as well as long-range temporal relationships across frames without any external module.
Inspired by kernel factorization [29] and channel factorization [28,3], we also design a depthwise spatiotemporal factorized component, named, D(2+1)D that decomposes a 3D depthwise convolution into a spatial depthwise convolution and a temporal depthwise convolution for making our network more lightweight and efficient.The efficient D(2+1)D makes it possible to use more input frames (over 16 frames), which is advantageous for temporal modeling.By using the proposed temporal modeling method, T-OSA, and the efficient factorized module, D(2+1)D, we construct two types of VoV3D architecture, VoV3D-M and VoV3D-L models.
Thanks to its efficiency and effectiveness of the proposed temporal modeling mechanism, VoV3D-L outperforms the state-of-the-art both 2D [17] and 3D [4] temporal modeling methods, while having much 6× fewer parameters and 16× or 6× less computation on Something-Something [6] and Kinetics-400 [13].Furthermore, the proposed VoV3D shows better temporal modeling ability than a state-of-the art efficient 3D architecture, X3D [3] having comparable model capacity.
The main contributions of this work are summarized as below: • We propose an efficient and effective 3D temporal modeling architecture, VoV3D that outperforms the state-of-the-art in terms of accuracy and model capacity.
• we introduce an effective temporal modeling method, Temoral One-Shot Aggregation (T-OSA) that can handle various visual tempo variation by aggregating features with different temporal receptive fields.
• We introduce an efficient depthwise factorized module, D(2+1)D that decompose 3D Convolution into spatial and temporal depthwise convolution, improving computation cost and accuracy.

Related works 2.1. temporal modeling for video classification
Recent Attempts for temporal modeling for video classification could be divided into two categories: 2D CNN based and 3D CNN based methods.2D CNN based methods such as TSN [30], TSM [18], STM [12] and TEA [17] prefer to use 2D CNN,e.g., ResNet-50 as backbone, due to its efficiency than 3D CNN networks.They process perframe inputs and aggregate these results to produce a final output on top of 2D ResNet.TSN [30] proposes to form a clip by sampling evenly from divided segments and this sparse sampling method becomes a common strategy for many works.TSM [18] is proposed to model temporal motion by utilizing memory shift operation along the temporal dimension.Since motion information is also an important cue for temporal modeling as a short-term temporal relationship, Attempts to model feature-level motion features are proposed in STM [12] and TEA [17].They propose to differentiate between adjacent features for representing motion features and add the spatiotemporal features and motion encoding together.TEA [17] also has a temporal aggregation module to capture long-range temporal dependency, similar to T-OSA.However, TEA is based 2D CNN features that are not jointly convolved along with spatial and temporal axis.This means interaction between spatial and temporal features is limited than 3D spatiotemporal methods.
There are works using spatiotemporal 3D CNN for modeling various visual tempos by building an input frame-level pyramid [4,36] or feature-level pyramid [35] .SlowFast has two network inputs with different frame rates to capture different types of information, e.g., semantic appearance or motion.DTPN [36] also uses a different sampling rate for arbitrary-length input video, which builds up the input frame-level hierarchy.unlike these methods, TPN [35] leverages the feature hierarchy on top of the backbone network, instead of the input frame level by building a temporal feature pyramid network.In short, since temporal modeling methods are based on the existing backbone networks, e.g., ResNet-50, They are constrained under the nature of the backbone network.

Efficient 3D CNN architecture
Since channel-wise separable convolution is densely exploited by efficient 2D CNN architectures [10,23,9,20,20,27,26], 3D CNN architectures [28,3,14] based on the extended depthwise separable convolution have been explored.CSN [28] adopt 3d depthwise separable convolution into the residual bottleneck block [8] by replacing the 3 × 3 × 3 convolution and adding a 1 × 1 × 1 convolution in front of the 3D depthwise convolution for interaction between channels.X3D [3] explores 3D CNN architecture along with spatial, temporal, depth, channel axis for maximizing the efficacy of 3D CNN network.The depthwise bottleneck is also utilized as a key component in X3D, while X3D is progressively expanded from a lightweight to a large-scale model by scale-up all kinds of axes.As a result, X3D achieves state-of-the-art performance with much smaller model capacity on various video classification datasets such as Kinetics-400.However, this method focus on building an efficient network while the temporal modeling is not considered deeply.Therefore, we focus on building an efficient 3D CNN architecture for temporal modeling.(a) illustrates that how the temporal receptive field is growing.For example, if four temporal convolutions with kernel size 3 are stacked successively, the temporal receptive fields are {3, 5, 7, 9 }.In (b), F 3×3×3 , F 1×1×1 denote 3D convolutions with 3 × 3 × 3 and 1 × 1 × 1 kernel, respectively.For modeling various visual tempos, the T-OSA aggregates the spatiotemporal features with different temporal and spatial receptive fields at once.

VoV3D
Temporal modeling plays an important role in action recognition.In particular, in the case of a video that lacks semantic variations of the features, video classification networks should rely heavily on visual tempo.Moreover, it is necessary to model long-term as well as short-term temporal relationship because short-term information is not sufficient to distinguish visual tempo variations such as walking vs. running.The conventional temporal modeling methods based on 3D CNN [4,36,35] try to model the visual tempo through the input (frames or feature-level pyramids.However, these methods as a external (i.e.,plug-in) module have to add separate networks on top of the existing 3D backbone (i.g., I3D [31], which requires more parameters and computations.To address these challenges, in this paper, our aim is to propose a lightweight and efficient video backbone network having temporal modeling ability by itself without external modules.For this purpose, We design a new 3D CNN architecture based on VoVNet [15,16] that expresses hierarchical and diverse feature representation at a small cost. First, we briefly revisit VoVNet [15,16] which is an inspiration for this work.Then, we introduce an effective temporal modeling method, named Temporal One-Shot Aggregation (T-OSA) based on the OSA module in VoVNet.For making a network lightweight and efficient, we also introduce a depthwise spatiotemporal factorization component, D(2+1)D.Lastly, we design a new video classification network, called VoV3D, which is comprised of the proposed T-OSA and D(2+1)D.

Revisiting VoVNet
VoVNet [15,16] is a computation and energy-efficient 2D CNN architecture devised to learn diverse feature representations by stacking One-Shot-Aggregation (OSA) modules.As shown in Fig. 1 (b), the OSA module consists of successive 3 × 3 convolutions and aggregates those feature maps into one feature map at once in a concatenate manner, followed by a 1 × 1 convolution.The OSA allows the network to represent diverse features by capturing multiple receptive fields in one feature map, which results in the effect of feature pyramid.Due to the diverse feature representation power of OSA, VoVNet outperforms ResNet [8] and HRNet [25] in object detection and segmentation task that require more complex representation.

Temporal One-Shot Aggregation (T-OSA)
Considering the hierarchical spatial features of OSA in VoVNet, it is natural to extend VoVNet to a 3D CNN architecture to model various visual tempos.Thus, as illustrated in Fig. 1, we expand OSA into Temporal-OSA, namely T-OSA, which captures diverse temporal receptive fields in one 4D feature map, along the temporal axis.In the details of T-OSA, the i-th where t is the temporal kernel size (we set to 3), i ∈ {1, 2, ..., n} and n is the number of t × 3 × 3 3D convolutions in T-OSA.It is noted that we keep temporal dimension T (frames) for feature aggregation.Each feature map X i ∈ R C×T ×W ×H that is the result from F t×3×3 i has progressively increasing temporal receptive field due to its successive connection.For example, if the temporal receptive field (TRF) of the feature map X 1 is 3 and temporal kernel size t is 3, the TRF of the next X 2 is 5. Thus, once the features are concatenated in channel-axis, the aggregated feature map X agg ∈ R (n+1)C×T ×W ×H comprised of {X in , X 1 , ..., X n } has diverse temporal and spatial receptive fields in one feature map, where X in ∈ R C×T ×W ×H is the input feature and n is set to 4 in Fig. 1(b).Then, a 1 × 1 × 1 convolution is followed for reducing channel size (n + 1)C to C and the residual connection is added to the final feature map.Therefore, stacking T-OSA makes it possible to model short-range as well as long-range temporal dependency across frames, which has an effect analogous to feature pyramid in the same spatial feature space.

Depthwise Sptaiotemporal Factorization
There are two types of factorization concept on 3D convolution (3DConv): 1) Depth (or Channel)-wise [28,3,14] and 2) Kernel-wise [29,21,34] methods.Inspired by efficient 2D image classification network [10,23,9,37,20,26,27], depth-wise separable convolution is also mainly used as a key building block for efficient video backbone networks [28,3,14].3D depthwise separable convolution (3DWConv) is utilized to factorize a 3D convolution into a t × k × k depthwise convolution followed by 1 × 1 × 1 pointwise convolution.CSN [28] adds a 1 × 1 × 1 convolution in front of the 3DWConv for preserving the interaction between channels, which results in improving accuracy.Tran et al. [28] found that the 3DWConv has two advantages: 1) significant reduction of parameters and computa- tional cost (FLOPs) without sacrificing accuracy 2) regularization effect.In addition to channel factorization, kernel factorization also has been widely used in [29,21,34] for curtailing computation and boosting accuracy.The kernel factorization is also called spatiotemporal factorization as it is decomposed into a 1 × k × k spatial convolution (space) followed by a t × 1 × 1 temporal convolution (time).Our motivation lies in fusion of these two factorization methods for realizing an efficient video classification network.We design a depthwise-spatiotemporal factorized module, D(2+1)D, that decomposes a 3DWConv into a spatial DWConv and a temporal DWConv as shown in Fig. 2. We analyze each resource requirement of models in Table 1 illustrating the number of parameters and computation (FLOPs) of a 3D convolution in the middle of bottleneck architecture in Fig. 2. The input tensor of the 3D convolution has T × C × H × W shape, where T, C, H, W are the number of frame, channel, height and width, respectively.Assuming the number of filters (output channel) is same (C), the 3D filter has t × k × k kernel size, where t, k denote temporal and spatial kernel, respectively.As demonstrated in Table 1, compared to the standard 3DConv in (a), 3DWConv in (c) is C× more efficient because it has only one sub-filter for the input tensor.We design two types of factorized module based on the order of spatial and temporal dim: D(1+2)D and D(2+1)D.It is noted that spatial down-sampling is operated in the spatial convolution and the temporal convolution keeps temporal dimension.compared with 3DWConv, both D(1+2)D and D(2+1)D have about one order of magnitude fewer parameters and compu- tation.In comparison between the two factorized modules, an important difference arises in spatial down-sampling.
The number of parameters is same, while the computation cost is different due to different spatial size.Specifically, for D(1+2)D, the temporal DWConv is operated first with T ×C×H×W input tensor followed by the spatial DWConv with stride s.It is summarized as: For D(2+1)D, since spatial DWConv with down-sampling goes ahead, the temporal DWConv operates the spatially down-sized input tensor, which results in reducing overall computation.This is summarized as: Meanwhile, Tran et al. in R(2+1)D [29] uses a nonlinearity (i.e., ReLU) between spatial and temporal convolution, whereas we don't use any non-linearity due to performance degradation.When we validate these models, D(2+1)D shows the best performance compared to counterparts.

VoV3D Architecture
Finally, we construct a lightweight and efficient 3D CNN architecture, VoV3D, that can model various visual tempo effectively with the proposed T-OSA and D(2+1)D modules.We design two types of lightweight models: VoV3D-M & VoV3D-L which have only 3.3M and 5.6M parameters, respectively.VoV3D is comprised of the proposed T-OSA blocks which consist of 4 or 5 D(2+1)D modules followed by a 1 × 1 × 1 convolution.This means that t × 3 × 3 3DConv F t×3×3 in Fig. 1 is replaced with the D(2+1)D module.In stage level (same spatial resolution), VoV3D has multiple T-OSAs, e.g., 5, in series, which leads to representing diverse temporal features.conv1 is also the (2+1)D style-convolution where 1 × 3 2 spatial convolution is operated and followed by a 3 × 1 2 temporal convolution.Following [3], we also add a channel attention module, SE [11] block, into the D(2+1)D with reduction ratio of 1/16.We note that the lightweight and efficient D(2+1)D allows VoV3D to reduce significant computation and thus it can use longer frames (>16) with 224 × 224 resolution, which enables to capture longer visual tempo.The details are illustrated in Table .2.

Datasets
We validate the proposed VoV3D on Something-Something (V1 & V2) [6] and Kinetics-400 [13].Unlike Kinetics-400 [13] that is less sensitive to visual tempo variations, Something-Something [6] is focused on humanobject interaction which requires more temporal relationship than appearance.Since Something-Something is widely used as a benchmark for evaluating the effectiveness of temporal modeling, VoV3D is mainly evaluated on this dataset.Something-Something V1 [6] contains 108k videos with 174 categories, and the second release (V2) is increased to 220k videos.Kinetics [13] includes 400 categories and provides download URL links over 240k training and 20k validation videos.Because of the expirations of some YouTube links, we collect 234,619 training and 19,761 validation videos.

Implementation Details
Training.Our models are trained from scratch without using ImageNet [22] pretrained model unless specified.For Something-Something [6], we use segment-based input frame sampling [18], which splits each video into N segments and picks one frame to form a clip (N frames) from each segment.We note that thanks to the memory efficient VoV3D, our model can be trained with more input frames, e.g., from 16 to 32.For Kinetics-400 [13], we sample 16 frames with a temporal stride of 5 as [3].We apply the random crop 224 × 224 pixels from a clip and random horizontal flip with a shorter side randomly sampled in [256, 320] pixels [24,31,4,3] for VoV3D-M and VoV3D-L models.In case of Something-Something, it requires discriminating between directions, so the random flip is not applied.Following [4,3] rate schedule [19], linear warm-up strategy [5], and weight decay 5 × 10 −5 .For Kinetics-400, we use the same training parameters except for 256 epochs and mini-batch size 128.We train all models using a 8-GPU machine and implementation is based on PySlowFast [2].
Following [18,33], we also fine-tune VoV3D using Kinetics-400 pretrained model.We use a linear warmup [5] for 2k iterations from 0.0001 and a weight decay of 5 × 10 −5 .We finetune the model for 50 epochs with a base learning rate of 0.05 decreased at 35 and 45 epoch by 0.1.We also use sync batchnorm.
In order to compare VoV3D-M/L to the strong state-ofthe-art X3D [3], we also train X3D-M/L having similar parameters and FLOPs with the same training protocols.Note that for X3D-L, unlike origin X3D paper [3], we use the same spatial sample size [256,320], not [356,446].The reason why we invest computation budget to more input frames (≥16) for the Something-Something dataset requiring more temporal modeling than spatial semantic information.
Inference.Following common practice in [31,18,2,3], we sample multiple clips per video e.g., 10 for Kinetics and 2 for Something-Something.We scale the shorter spatial side to 256 pixels and take 3 crops of 256×256, as an approximation of fully-convolutional testing [31] called full resolution image testing in TSM [18].Then, we average the softmax scores for prediction.

Main results
Results on Something-Something.We validate the efficiency and effectiveness of the proposed VoV3D on Something-Something V1&V2 (SSv1/v2) requiring more temporal modeling ability than spatial appearance.Table 3 shows the results and resource budgets of other methods: temporal modeling based on 2D CNN methods [18,12,17] and 3D CNN architectures [32,2,35,2,28,3].First, under the same 16 frames, VoV3D-M/L consistently outperform X3D-M/L with a comparable model budget on both Some-thingV1&V2.In particular, performance gain of 'L' models is bigger than 'M' models, e.g.1.7%/0.2%vs. 2.5%/1.4% @Top-1.This result demonstrates that stacking the proposed T-OSAs makes it better to model temporal dependency across frames.
VoV3D is also superior to those 3D CNN based temporal modeling methods, such as SlowFast [4] and CSN [28].Even without Kinetics-pretraining, VoV3D-M-16F achieves higher accuracy than SlowFast pretrained on Kinetics-400 with 11× more model capacity.It demonstrates that a 3D single network path is enough to model various visual tempo variations.Although CSN [28] contains the depthwise bottleneck architecture, its accuracy is lower than VoV3D-M with the same 32 frames.This result shows that the proposed T-OSA plays an important role for temporal modeling.

Results on Kinetics-400.
We compare VoV3D to other state-of-the-art methods on Kinetics-400.VoV3D-L achieves 76.3%/92.9%top-1/5 accuracy, and it shows the better performance than the state-of-the-art temporal modeling 2D method, TEA [17], even without ImageNet pretraining.VoV3D-L also surpasses 3D temporal modeling methods, SlowFast [4] 4 × 16 based on ResNet-50 while having about 10× and 9× fewer model parameters and FLOPs, respectively.Compared to ip-CSN-152 [28] as an efficient 3D CNN, VoV3D-L shows slightly lower top-1 ac-curacy, but it achieves higher top-5 accuracy with much less model capacity.Also, the accuracy of VoV3D-L achieves the comparable performance of Top-5 to the X3D-L that uses larger spatial scale (i.e.[356,446]).If the proposed method is trained on a large spatial scale, we expect to achieve similar performance to the X3D-L even on the Top-1, but to focus on temporal modeling, we invested the computational budget to increase the input frame rather than the spatial shape.

Ablation study
We conduct ablation studies for the proposed components of VoV3D on Something-Something V1 because the Something-Something dataset requires more temporal modeling ability than Kinetics-400.Specifically, we validate whether Temporal-One-Shot-Aggregation (T-OSA) is effective for temporal modeling.Next, we evaluate the efficiency of the depthwise spatiotemporal factorization module, D(2+1)D.
Temporal-One-Shot-Aggregation (T-OSA).If there is no T-OSA modules in VoV3D, it is similar to X3D [3], thus we make X3D a comparison target.As can be seen in Table 5, we compare our proposed method to the X3D-M/L in terms of the number of parameters, FLOPs, and accuracies.Furthermore, we compare our proposed method to the X3D-M/L under the condition that the proposed depthwise spatiotemporal factorization module (i.e., D(2+1)D) is incorporated.consistently outperform X3D-M/L.In particular, the accuracy gap (1.9% / 0.9%) of the large models (VoV3D-L vs. X3D-L) is bigger than that (0.5% / 0.6%) of the medium models (VoV3D-M vs. X3D-M).This shows that the proposed T-OSA is effective for long-term temporal modeling in Something-Something dataset.In addition, through the comparison of the accuracies between the deeper model (i.e., X3D-L) and VoV3D, we observe that accumulating more T-OSAs boost the temporal modeling effect.Furthermore, when the depthwise factorization module (i.e., D(2+1)D) is incorporated, the difference of the performance between the X3D and VoV3D is consistent.
As explained in Sec.3.3, we alternatively plug the bottleneck architectures as shown in Fig. 2 into the T-OSA.Although R(2+1)D [29] reduces both parameters and GPLOPs from the standard 3D convolutions [7], the depthwise bottleneck [3,28] in Fig. 2(c) significantly reduces the computations (about 20×).From this, we test further by decomposing the depthwise bottleneck into temporal and spatial depthwise convolution, i.

Conclusion
We have proposed an efficient and effective temporal modeling 3D architecture, so called VoV3D, that consists of Temporal One-Shot Aggregation (T-OSA) and depthwise spatiotemporal factorized module, D(2+1)D.The T-OSA is able to effectively model various visual tempos by aggregating features having different temporal receptive fields.The D(2+1D) module decomposes 3D depthwise convolution into a spatial and temporal depthwise convolution, which makes the proposed VoV3D significantly lightweight and efficient while improving accuracy.Thanks to T-OSA and D(2+1)D, our VoV3D outperforms the state-of-the-art 2D efficient CNN as well as 3D CNN methods for temporal modeling.We hope that it can serve as an efficient baseline for video action recognition.
e., D(1+2)D and D(2+1)D.As shown in Table6, both D(1+2)D and D(2+1)D outperform other state-of-the-art methods while reducing model capacity.In particular, D(2+1)D with fewer FLOPs yields more accuracy gain than D(1+2)D.We conjecture that the the preceding spatial convolution makes the input features of the temporal convolution bigger spatial receptive field.In addition, we could not confirm the effectiveness of the non-linearity between temporal and spatial convolution as claimed in R(2+1)D[29].When we add ReLU or BN-ReLU into D(2+1)D, the results show worse accuracy as shown in the sixth and seventh rows of This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.B0101-15-0266, Development of High Performance Visual Big-Data Discovery Platform for Large-Scale Realtime Data Analysis and No. 2020-0-00004, Development of Previsional Intelligence based on Long-term Visual Memory Network).

Table 1 :
Comparison of parameters and computation.This table considers only a 3D convolution located in the middle of the bottleneck in Fig.2.t, k, and s denote temporal, spatial kernel size, and stride, respectively.C, H, W , T denote channel, height, width, the number of frames in the input 3D feature map, assuming input/output channel size is same.

Table 3 :
[2]parison with the state-of-the-art architectures on Something-Something V1& V2 val set.The symbol † denotes our implementation.Note that X3D models are trained by ours with the same training protocols with VoV3D based on PySlowFast[2].

Table 5 :
Validation of T-OSA compared with X3D on Something-Something V1.Both X3D and VoV3D are trained by the same training protocols based on PySlowFast [2].

Table 6 .
We speculate that non-linearity interferes with the connection of the spatial and temporal depthwise convolutions that have not yet performed channel interaction.

Table 6 :
Validation of D(2+1)D compared to other bottleneck modules on Something-Something V1.We plug different bottleneck modules into VoV3D-M model.6/7th rows results show the influence of non-linearity between spatial and temporal convolution.