Multi-Label Multi-Class Action Recognition With Deep Spatio-Temporal Layers Based on Temporal Gaussian Mixtures

Current action recognition studies enjoy the benefits of two neural network branches, spatial and temporal. This work aims to extend the previous work by introducing a fusion of spatial and temporal branches to provide superior action recognition capability toward multi-label multi-class classification problems. In this paper, we propose three fusion models with different fusion strategies. We first build several efficient temporal Gaussian mixture (TGM) layers to form spatial and temporal branches to learn a set of features. In addition to these branches, we introduce a new deep spatio-temporal branch consisting of a series of TGM layers to learn the features that emerged from the existing branches. Each branch produces a temporal-aware feature that assists the model in understanding the underlying action in a video. To verify the performance of our proposed models, we performed extensive experiments using the well-known MultiTHUMOS benchmarking dataset. The results demonstrate the importance of our proposed deep fusion mechanism, contributing to the overall score while keeping the number of parameters small.


I. INTRODUCTION
Unquestionably, action recognition is currently a topic of active research due to not only the challenges that researchers must be overcome in this field but also due to the importance of action recognition applications in our daily lives. Such applications include criminal detection, more timely alert systems for natural disasters, video summarization, and video recommendation systems. One of the challenges in classifying an action in a video is to capture accurately the hidden temporal relation between frames in addition to capture the spatial semantics. To take both aspects into account, the filters of the convolutional layer must operate on 3-dimensional data (3D), in contrast with the more general approach for image classification that uses only 2-dimensional (2D) convolution.
The use of 3D convolution has been shown to be effective in capturing both temporal and spatial dimensions. Although a video consists of spatial and temporal aspects, there have The associate editor coordinating the review of this manuscript and approving it for publication was Xianye Ben . been several attempts to understand the variations in actions through the use of 2D convolution without sacrificing accuracy [1], [2]. Such 2D convolution helps the computer to train faster and reduce the inference times than would be the case using 3D convolution.
A different approach has been proposed to use optical flow fields as an additional input modality, while keeping the RGB images as the main input. This modality focuses only on pixel displacement over time, thereby removing unnecessary spatial information contained in frames. Optical flow is computed by taking the difference of two consecutive frames, hence providing useful information regarding the motion.
Because more than one modality is used, several studies have used at least two separate architectures to accommodate different modalities to learn the spatial and temporal information separately. Optical flow components from the RGB images are extracted and then fed together into separate architectures, that will be referred to as the ''branches'' for each separation in this paper. Each branch is expected to be complementary to the other branches. Previous works based on this approach [3]- [5] have obtained impressive results. Different architectures have been introduced that use the recurrent model to understand data sequences. Considering frames in a video as a sequence of items of information, we can consider this technique as aiming to capture temporal relations contained in frames.
All of the methods mentioned above are suitable only for capturing a fixed length of video data, i.e., regardless of the length of the video, frames are sampled to determine the action category. A study by Piergiovanni and Ryoo [6] addresses this drawback by attempting to learn temporal structure using notably fewer parameters of a custom convolutional layer. Their work successfully implements a temporal Gaussian mixture (TGM) layer on top of two-stream inflated 3D ConvNets (I3D) [7] and InceptionV3 [8] to reveal the temporal relations in videos. Specifically, [6] employs a set of Gaussian-controlled filters/kernels responsible for detecting frames to which the model should pay more attention. In addition to its ability to capture the temporal structure of an activity, their system also features a kernel that is independent of the duration of a video due to its fully convolutional design. This design also provides the ability to handle very long videos. Moreover, they also conduct experiments using a favored ''two-stream'' configuration design to classify actions in a video.
One would question the possibility of mixing both RGB and optical flow feature representations and then processing the resulting representations further using a TGM layer. Previous work [6], [7] appears to use only two streams, neglecting the potential contribution of mixing these two representations. This is understandable because several research efforts reported in the literature, as described later, have demonstrated the significance of mingled modalities.
Based on this intuition, we propose three new fusion models that are considerable improvements on the previous work. We utilize different fusion mechanisms between the different branches of the TGM layers to enrich a model with a refined feature set. Concretely, we propose combinations of spatial and temporal feature set on a temporal branch and on a new spatio-temporal branch at different levels. The main contributions of our work are summarized as follows: 1) A series of TGMs as a deep spatio-temporal branch.
We propose a deep spatio-temporal branch comprised by several TGM layers to boost the accuracy in estimating the multi-label, multi-class problem for a given action video. 2) Unified multi-branch action recognition architecture.
We combine the Gaussian-based spatial and temporal branches with a deep spatio-temporal branch in a unified architecture. These branches are named based on their input modalities.

3) Custom TGM layer on a temporal branch.
We replace the 1 × 1 convolution with the maximum function inside the TGM layer, i.e., taking the maximum value on the input-channel axis, in order to make the TGM layer of the temporal branch more aware of the distinctive action features. This replacement results in finer action features for the subsequent layer. We confirm that this approach is superior to that of the original work.
The remainder of this paper is organized as follows. Section 2 describes related studies and explains their principal concepts. In Section 3, we introduce our proposed TGM-based multi branch network. Section 4 includes the details of our experiments and evaluation and describes the comparison with state-of-art methods. It also includes an ablation study to determine the optimum result. Section 5 provides concluding remarks regarding our work and points out future directions.

II. RELATED WORK
We describe some work related to hand-engineered features and deep neural networks for action recognition.
An action can be described as a sequence of primitive movements [9], e.g., kicking a ball and closing something. Such action sometimes consists of several simple actions, e.g., cooking involves taking ingredients and pouring them into a pan. Prior to the development of deep neural networks, several traditional algorithms existed to quantify inputs for action recognition in a video. Those algorithms use two modalities to produce a quantified vector. The RGB images are known to contribute significantly to action recognition systems. A study in [10] explained very well the progress in action recognition based on the use of the RGB data. Other classic techniques have been developed by Bobick and Davis et al. [11], who created motion templates using methods called motion energy image (MEI) and motion history image (MHI). Then, a work by Klaeser et al. [12] introduced a 3D version of histogram of gradients (3DHOG), an extended version of 2DHOG, to project human action in a space-time dimension. Another work by Scovanner et al. [13] extended the scale invariant feature transform to 3DSIFT and used a bag of words to represent videos. Schuldt et al. [14] used a support vector machine (SVM) to classify space-time 3D descriptors into action categories. Ben et al. [15], [16] introduced a technique called coupled patch alignment and a general tensor representation framework for gait recognition. In addition, an optical flow field plays an essential part in action recognition to increase the robustness of a system, as can be seen in [17]- [19]. Generally, handcrafted spatial and temporal video features have been used to describe spatial and temporal aspects that will be used in classic machine learning to determine an underlying action.
Since convolutional neural networks have emerged to outperform any existing traditional machine learning techniques, researchers have competed by using them to build very deep neural networks. In one preliminary attempt, researchers used 2D convolution to learn an action frame by frame, as described in [20], [21]. This 2D convolution operated on one frame of a video at a time. It was found that convolution VOLUME 8, 2020 FIGURE 1. Overview of our best proposed model. For each branch, we add 3 layers of TGM to learn complex temporal structures. On top of each branch, we reduce the output dimension (C × D × T → D × T ) and adjust the feature maps dimension (D) to be three times lower than the original dimension by using max function and 1D convolution, respectively. The outputs from three branches are then concatenated on the dimension axis and passed to another 1D convolution to map from D size (i.e., 1,024) to N size (the number of classes, i.e., 65). Last, the sigmoid function is used to obtain the prediction for each time step. For simplicity, we omit a shortcut connection, dropout, and ReLU layer.
in 2D space still performs well on some datasets although it does not model temporal patterns. Since the video naturally contains motion, 3D convolution has been proposed in recent studies and is expected to show improved performance because it learns spatial and temporal dependencies simultaneously. The work in [22]- [24] utilized a 3D convolution kernel to include motion cues. Other work that uses 3D convolution includes a 3D version of Inception-V1 by Carreira and Zisserman [7] and 3D ResNet by Hara et al. [25]. The achievements of image classification inspires such work; the researchers inflated all of the 2D filters and the pooling windows to enable the network to accept a stack of frames. In other words, they replaced all N × N filters with a cuboid version -N × N × N . Many researchers have been inspired by 3D ResNet and have adopted it in their studies [26], [27]. Several studies have also investigated the use of a recurrent network, such as long short-term memory (LSTM) or bi-directional long short-term memory (Bi-LSTM) to classify an action, as described in [28]- [31]. These authors argued that implementing a recurrent network on top of a convolutional neural network (CNN) backbone will enable the capture of sequential information of a video.
Insightful and innovative multi-stream methods have also been reported [4], [6], [32], [33]. As mentioned earlier, each stream conveys a different type of data. The types of data can be partitioned into several categories such as RGB, optical flow, RGB difference, and audio. Several fusion mechanisms have been demonstrated. Chi et al. [34] adopted a self-attention mechanism for which the input sources are RGB and optical flow. Feichtenhofer et al. [3] investigated several techniques for fusing different modalities at specific layers and proposed a fusion scheme containing two fusion strategies, namely, spatial and temporal. In the current work, we consider only spatial fusion, leaving temporal fusion for future research. Among spatial fusion strategies (Sum, Max, Concatenation and Conv fusion) on the UCF101 benchmarking dataset [35], these authors reported that the Sum fusion outperforms other functions except for Conv fusion. Of course, learning randomly initialized weights when using the Conv fusion consumes more training time compared to Sum fusion because the latter operates on two feature maps by simply summing them. We use the Sum fusion function in our work because it is simple and yet quite robust.

III. OUR PROPOSED MULTI GAUSSIAN-BASED BRANCH A. MULTI TGM LAYERS
We introduce our new architecture that consists of several Gaussian kernel-based branches. The overall structure of our best model is shown in Figure 1. As seen in the figure, our model uses two modalities, RGB data and optical flow. Thus, we have two branches based on their input types, namely, spatial and temporal, to classify the actions. Given the outputs from the base CNN, represented as F ∈ R T ×1×1×D , we first propagate F to the spatial and temporal branches. T and D are the temporal length and the feature maps dimension, respectively. We set the number of the TGM layers to three for each branch, identical to that of the referenced work [6]. Inside the TGM layer, we learn a set of the Gaussian mixture kernels. For each input channel j ∈ [1, C in ] and each output channel i ∈ [1, C out ], the associated filters will convolve on the temporal input feature x and map the resulted channels into a single channel to produce s i : where K = [K 1 , K 2 , . . . , K c out ] denotes the Gaussian mixture kernels corresponding to each input and output channel and w i is a 2D convolution with 1 × 1 kernel size and one as the output channel, followed by the rectified linear unit (ReLU) activation function. We include more details of the The output (C out × D × T ) from the earlier TGM layers of the upper branch is fed not only to the subsequent layer but also to the temporal branch. A simple addition operation is performed in the temporal branch to guarantee one-way information sharing. An explanation for the remaining parts of the diagram is identical to that of Figure 1.  Figure 2, a middle branch is created to learn representations simultaneously using blended features rather than passing the representation from one branch to another existing branch. The name ''spatio-temporal'' branch is defined to this new branch. Notice that the addition operation only occurs in the beginning of the spatio-temporal branch. With only a few extra parameters, this configuration surpasses Proposed #1 (see Table 1 for this comparison). The remaining parts of the diagram are identical to that of Figure 1.
TGM kernel/layer in the subsequent section. Notation for channel-wise operation and ReLU activation function is ignored for simplicity. The obtained s i is then stacked on the channel axis to produce S rgb = [s 1 , s 2 · · · , s C out ] with a dimensionality of C out × D × T , where C out is the number of the output channels. C in and C out can be considered as hyperparameters. We set C out to four in this work. The value of D is consistent throughout all of the TGM layers (i.e., 1,024). Going beyond the previous work, we argue that the maximum function accentuates the distinctive, important aspect of temporal features. Thus, for the temporal branch (S flow ), we replace 1 × 1 convolution + ReLU (Equation (1)) with the aggregate function. Mathematically, this can be formulated as: In Equation (2), we replace w with the max function operating on the channel axis. The output s i is then appended along the channel axis to obtain S flow which is the C out × D × T representation, identical to that of S rgb . Then, S rgb and S flow are fed forward to the next layer. Figure 7 illustrates the overall process inside the TGM layer in a single branch. In addition, the output of the base model CNN and the last TGM layer are concatenated (see illustration in Figure 4).

B. ROLE OF TGM KERNEL
The Gaussian mixture kernel used in this paper is introduced in [6]. In essence, this kernel is a constrained kernel governed by the variance of the Gaussian, i.e., a center µ and a width σ for which the values are in the positive range. As observed in Figure 5, this layer also includes an attention mechanism widely used in computer vision and language processing. A soft attention is applied to each Gaussian distribution to enable the layer to focus on the relevant parts in a temporal sequence.
We would like to highlight the implementation of the TGM layer in our proposed spatio-temporal branch. In the original paper, the authors implement the TGM layer only on the RGB and optical flow modality, whereas the TGM layer is utilized on mixed modalities in our implementation.

C. OUR PROPOSED FUSION MODELS
We propose three fusion mechanisms to enhance the performance of the model. 1) Spatial and temporal fusion model (Proposed #1) As observed from Figure 2, a fusion is introduced between the TGM layers. This type of combination is similar to those in work by Feichtenhofer et al. [36].
The two lateral connections are established from the spatial branch to the temporal branch in order to merge a meaningful representation of the RGB data. In contrast to their work where the outputs are transformed prior to fusing, we merely add the RGB and optical flow features because the shape and length of those two type of features already match. Given S rgb and S flow that are the outputs of the previous TGM layers in each branch, a new S flow is determined according to the following equation: where i is the index where fusion occurs.

2) Early spatio-temporal fusion model (Proposed #2)
In this approach, we construct a new pathway from the existing branches. We introduce a spatio-temporal branch with a fusion occurring at the beginning of the branch (see Figure 3). We want our model to learn not only spatial and temporal information separately but also spatio-temporal information simultaneously. We are confident that the model will benefit from this fusion. Formally, a new spatio-temporal branch (S st ) results from the element-wise addition of S rgb and S flow .
3) Multi-level spatio-temporal fusion model (Proposed #3) In contrast to Proposed #2, we carry out a fusion strategy at several levels. We argue that each level of the TGM layers produces different temporal activity patterns. As described in Equation (5), given features S rgb and S flow , we perform element-wise addition at  . Some videos of the MultiTHUMOS dataset. Each video may contain multiple activities; hence, the task is categorized as a multi-label multi-class classification problem. The image is taken from [37]. different levels to form the spatio-temporal branch, denoted as S st .
We assign the term ''spatio-temporal'' to our new branch since this branch operates on mingled modalities, namely, the RGB and optical flow components. The term ''spatio-temporal'' is also used to describe the role of 3D ConvNets; i.e., a 3D kernel convolves not only on the surface of the feature maps but also on the depth of the temporal axis of the feature maps.

IV. EXPERIMENT
In this section, we describe our experiment in detail. We demonstrate the implementation of the TGM layer in each branch and simultaneously fuse the spatial and temporal branches to achieve a positive result. Moreover, we conduct some ablation studies to emphasize the benefit of the Max function over 1 × 1 convolution inside the TGM layer and to investigate the benefits of the weighted branch scheme.

A. DATASET
We conducted some experiments on the untrimmed, multi-label MultiTHUMOS dataset [37] to investigate the FIGURE 7. Inside the TGM layer. A Gaussian weighted kernel with length L is multiplied by each input channel C in to produce a tensor with the shape of C in × D × T . We apply a slight alteration to this layer: we substitute 1 × 1.2-dimensional convolution (indicated by the purple box) with the max function to combine the input channels. We note that the figure illustrated above that was taken from [6] represents the original TGM layer.
effectiveness of the spatio-temporal branch. This dataset is an extension of the well-known THUMOS dataset, with its number of action classes extended from 20 classes to 65 classes. Verbs are taken from the original THUMOS with several additions of various activities.
In total, we examined 30 hours of duration from 400 videos. This contains 38,690 annotations with every frame having 1.5 labels on average, notably higher than THUMOS (0.3 per frame). Each video has 10.5 action classes, increasing sharply from the 1.1 of THUMOS. Examples of the videos are illustrated in Figure 6.

B. FEATURE EXTRACTION
Prior to video feature extraction, we first extract all of its RGB frames and optical flow. We crop all of the extracted frames to a window size of 224 × 224 at the center to match the base model's input dimensions. We also normalize all pixels to have values with a range of [−1. . . 1]. In addition, we use the well-known TVL1 algorithm [38] to compute the optical flow. The next step is to extract the video features from the image data. Prior to extracting, we load our model with the weights pre-trained on the ImageNet and the Kinetics dataset. This is a common transfer learning technique. We propagate onward our collection of frames via the I3D network to obtain the activations. We select the last average pooling AvgPool3d of I3D to serve as an endpoint logit for feature extraction, same as the referenced work. The extracted shape of each video is T ×1×1×D. The value of T can be varied depending on the length of a video whereas D is consistent for all videos (i.e., 1,024). We note that the TGM layer can accept arbitrary temporal lengths. The output features are then formatted as NumPy arrays and saved to a disk. If the number of frames is excessively high (i.e., a video has a long duration), then we divide it by the threshold (in our case, we set it to 100) and forward propagate the current chunk to the I3D network. The value of 1,024 represents the number of channels. The term ''channel'' here can be ambiguous. We typically use this word to describe the number of feature maps, whereas ''channel'' in the TGM layer, represented by C in and C out , refers to the number of Gaussian mixtures.

C. TRAINING AND TESTING
We conduct all experiments using the PyTorch framework. The Adam optimizer [39] is applied to optimize the model's parameters. We choose 5e−3 as our starting learning rate and decrease it if the loss becomes saturated. The training routine is finished at 60 epochs. To minimize the loss, we use the binary cross-entropy loss function, estimated as follows: where L is the loss score, y i is the true label for a specific class occurring at a specific time, and p(y i ) is our model's prediction. To measure the performance, we define the mean average precision (mAP) as follows: According to Equation (7), the mAP is calculated by summing each AP (AP c ) and then dividing by N (the number of queries), whereas the AP for a specific class is computed using precision at each relevant position n: where M is the total number of actions predicted. VOLUME 8, 2020

D. RESULTS
Clearly stated, our aim is to improve previous on the work [6] by implementing a variety of fusion schemes to increase the mAP. We postulate that a fusing mechanism is important for accuracy. In this section, we compare our three proposed models with the baseline and state-of-the-art methods. We evaluate three type of fusions and conduct short analyses based on the results. An examination of the results presented in Table 1 shows that by integrating the information of the spatial branch into the temporal branch, the improvement of Proposed #1 over the baseline is marginal (0.9% higher). This confirms the advantage of the one-way information sharing of two branches. In the next approach, the model Proposed #2 does benefit from having a newly created branch in the form of a 1.5% improvement over the baseline. This certifies that inserting a new branch and simultaneously optimizing the weights of all of the branches will improve the performance to some extent. Our last and best model, Proposed #3, outperforms the baseline method by 2.9%. We believe that this model captures more complex, nonlinear temporal features, and thus contribute significantly to the performance of the model.
We can conclude that a fusion mechanism demonstrates consistent performance over the baseline model and other existing models on the MultiTHUMOS dataset.

E. ABLATION STUDY
This section describes experimental studies on MultiTHUMOS. We conducted several comparisons to identify the advantageous strategies. We experiment with various channel combination strategies. We also conduct ablation experiments of weighted scheme for spatial and temporal branches.

1) CUSTOM CHANNEL UNIFICATION
Referring to Figure 7, we observe that the original version applies a 2D convolution to combine temporal reasoning features on the channel axis (see Equation (1)). This 1 × 1.2D convolution is designed to combine all of the C in into one channel.
To the best of our knowledge, there are simpler, less expensive yet effective techniques for achieving channel combination: summation, average, and maximum. These functions have no parameters to optimize and thus have a lower computational cost. We performed several experiments to determine the effectiveness of each function compared to those obtained in the original work. The results are presented in Table 2. We observed that 1 × 1 convolution is more beneficial to the spatial branch and, conversely, is not useful for the temporal branch. Surprisingly, the TGM layer with the maximum function performs better on the temporal branch. We hypothesize that the reason for this function showing little improvement is that each pixel in the temporal branch corresponds to small changes; thus, the maximum function ensures that the model takes only distinctive pieces of representation in each channel. Consequently, the subsequent layer processes more fine-grained temporal features.

2) WEIGHTED BRANCH
An examination of the data presented in Table 2 shows that each branch makes different contributions to the performance. Thus, it is natural to ask whether the use of a weighted scheme per branch can lead to an increase of model performance. We performed several ablation experiments to determine the optimal combination of weights per branch. We weight each input of the first TGM layer, in both the spatial and temporal branches. We changed the values randomly and found the optimal combination of weights. The results are described in Table 3. Even though the difference is marginal, the results confirm that each branch contributes to the performance unequally, in agreement with the results presented in Table 2.

F. ANALYSIS
In this section, we discuss how well the proposed models classify the frames into the predefined action classes. To accomplish this, we plot the prediction into the temporal region with the X-axis being the time axis (see Figure 8). We show video images that describe the two actions ''Run'' and ''VolleyballSpiking''. A red dot indicator is placed below a person performing a specific activity at time t. It is observed that interpolating semantic information between two branches is beneficial for improving the performance. Best viewed in color. This figure enables us to determine the accuracy of each proposed model by comparing the nonblack regions (the proposed models) with the black region (the ground truth) in the video of vollyball game. It is observed that our proposed models exhibit good performance in some activities. We plot several video frames to describe the activities of ''Run'' and ''VolleyballSpiking''.
In the Run activity, surprisingly, our proposed models greatly improve the performance while the baseline model fails to predict the activity. Figure 8 shows that a person with a red dot is performing Run before jumping and hitting the ball. In the right-hand layout, our proposed models correctly predict this activity; the red temporal region stretches for some time whereas the baseline model misses the prediction.
We also show the advantage of fusing the TGM layers in VolleyballSpiking. While the baseline model fails to accomplish the prediction, our proposed models enjoy the benefits of the fusing mechanism that can locate and classify the spiking activity in video frames. Again, a man with a red dot indicator is carrying out a volleyball spike. At the same time, this man is also performing the Jump activity. Our models predict Jump activity with a higher confidence level compared to the base model: the temporal regions are drawn continuously for our proposed models.
Despite successful predictions, we also observed the occurrence of misprediction. As shown for the Sit activity, none of the models can classify this activity even when it is present in reality. We suspect that because no changes are occurring temporally (i.e., the object/person is motionless), our proposed models have difficulty in learning its temporal structure. Furthermore, it is possible that the object is relatively too small for detection purposes and thus that our proposed models fail to notice it. We note that a false-positive prediction also occurred, as was found for the VolleyballSpiking and VolleyballBlock activities. In earlier times of these activities, all of the models predict a non-existence action, in contrast with the prediction after some t time.
We also observe that all of the proposed models are accurate in Fall, VolleyballSet, and NoHuman class. Additionally, we are confident that the fusion of two modalities, regardless of how they are fused, exhibits very good performance compared to the baseline. This suggests that it will be beneficial to explore other fusion techniques (e.g., temporal fusion, as mentioned earlier).
Furthermore, we examined the number of learnable parameters and compared it with other works. Table 4 indicates that three branches (spatial, temporal, and spatio-temporal) with three stacked TGM layers have notably fewer parameters; therefore, it is a lightweight layer.

V. CONCLUSION
In this paper, we proposed new unified multi-branch neural network models consisting of a sequence of TGM layers that serve as a spatial, temporal, and spatio-temporal branches for performing multi-label multi-class classification tasks. Outputs from the spatial and temporal branches are interpolated to form the spatio-temporal branch, with only a few learnable parameters added. We also demonstrated the benefit of using the maximum function inside the TGM layer to combine the input channels. The experimental results have shown that our proposed fusion strategies with spatio-temporal model learn temporal structure effectively, ultimately improving the activity detection performance and outperforming the baseline and several other designs on the MultiTHUMOS dataset.
For future work, as discussed in section II. Related Work, we are interested in fusing the two branches temporally, as demonstrated in [3], where the authors claim that a temporal fusion strategy can achieve larger gains in the model performance.