Behavior Recognition Based on Category Subspace in Crowded Videos

Crowd behavior refers to a collective behavior composed of two or more individuals who influence, interact, and depend on each other for a specific goal. Compared with an ordinary crowd behavior, the probability of a dangerous crowd behavior is much smaller. Video-based crowd behavior recognition can be categorized as one multi-label classification task, which is characterized by complex scenes and imbalanced samples. Aimed at tackling problems of imbalanced samples and multi-label task, a classification method of associative subspace is proposed. For a single category (called main category) with fewer samples, this paper generates a special subspace wherein it is relatively easy to distinguish these samples by association with other categories. A classifier that can weaken the main category and strengthen relationship between the main category and other categories is designed in the subspace. Therefore, the main category can contribute to reducing dependence on the number of samples with the above-mentioned classifier in the corresponding subspace. In order to make full use of the relevant information concerning categories, multi-label information is further injected into spatio-temporal features of video action representation. Experiments on a challenging WWW dataset show that both the proposed subspace method and multi-label information fusion mechanism are efficient.


I. INTRODUCTION
Crowded videos of the same category may contain different scenes, different numbers of people, different fields of vision, making the crowded video classification task very challenging [1]- [4]. In the real world, dangerous crowd behaviors are unlikely to occur, which makes it difficult to collect sufficient video data. However, dangerous crowd behaviors (stampede, riots, etc.) can cause huge loss of property and lives easily. Even more, existing crowded video data sets are very poorly balanced [1] including numerous samples of general crowded videos but very few dangerous ones. Therefore, it is very important to investigate the classification of imbalanced crowded videos.
Crowded videos often contain multiple events and behaviors, thus making the issue of video-based crowd behavior recognition a multi-label classification task. Due to The associate editor coordinating the review of this manuscript and approving it for publication was Fan-Hsun Tseng . the emergence of the large crowd behavior video dataset (WWW), a series of multi-label crowd behavior recognition algorithms have been proposed [1]- [3]. Among the algorithms above, Shao et al. [1] developed a multi-task deep model for joint learning by combining appearance and motion features for a better crowd understanding. Besides, based on category dependencies, the algorithm has also improved multi-label crowded recognition performance under the guidance of manually defined rules. However, the above method does not consider the imbalance of samples, which is disadvantageous to the category of fewer samples in the process of training. To enable categories with fewer samples to perform better in classification, this paper has introduced the idea of subspace. We construct subspaces using category association, which can effectively solve the problem of imbalanced classification. The associative subspace principle is shown in Fig.1. When it comes to distinguish the main category, we expect to generate a suitable subspace in which it can be easily distinguished by using correlation information between the main category and the other categories. For each category with a small number of samples, this study needs to generate a corresponding subspace so that the main category can better use the subordinate relationship between the categories in the corresponding subspace.
Regarding the task of imbalanced classification, we construct subspaces with category information so that categories with fewer samples can achieve satisfying performance. Meanwhile, we design a classifier for categories with fewer samples. The classifier is utilized to optimize the current category, which is different from the globally optimized classifiers [3], [5], [6] utilizing the relationship between categories. Specifically, the classifier can weaken the main category and reduce the dependence of the main category on the number of samples on the one hand, and enhance the relationship between the main category and other categories for the indirect classification of the main category on the other hand. In a nut shell, the classifier designed for categories with fewer samples can optimize the current subspace by weakening the main category and weighting the association relationship between categories.
Feature representation plays a pivotal role in visual classification tasks [2], [3]. At present, the mainstream video classification features are obtained through two streams (static and dynamic). Most existing video features [1]- [3], [7]- [9], however, are easily affected by appearance and motion noise, due to the substantial differences in crowded scenes and great variety of motion information. Thus, to tackle the recognition task in crowded scenes, we choose to combine motion trend features with dynamic evolution features. First, multi-label information is integrated into 3D dynamic features by a Graph Convolution Network (GCN) [10] to capture the global motion trend. Then, along with the motion trend, a Long Short-Term Memory (LSTM) network with memory function is used for collecting important evolution features and erasing dynamic and appearance noise.
The main contributions of this paper include: (1) the idea of subspace for categories of fewer samples; (2) the classifier based on subspace correlation and designed to address the problem of imbalanced samples during classification; (3) feature representation that combines motion trend features with dynamic evolution features to enhance the description of video change trend.

II. RELATED WORK
In this section, we will introduce and discuss the related work towards multi-label classification in terms of subspace construction, subspace classifier design and feature representation, respectively.
Subspace clustering seeks to partition the original space into multiple subspaces within a dataset, and clustering algorithms have been widely used for determining the subspaces [11]- [13]. Among the existing subspace clustering methods, the spectral clustering methods [14]- [17] have become increasingly popular because of easy implementation, as well as high probability to converge to a global optimum compared with conventional clustering algorithms. However, it still remains a drawback that the mixed-signed result given by eigenvalue decomposition of Laplacian may degrade the clustering performance [18]. Thus, to address this problem, literature [19] established an equivalence with spectral clustering and proposed two non-negative spectral clustering algorithms. Although spectral clustering has been achieved in many applications, the relationship between the affinity matrix and the labels of the data is not fully exploited, thus there is no guarantee for an overall optimal performance. To overcome the challenge, a new unified optimization framework is proposed, which enforces the coherence and discrimination of the affinity matrix as well as the labels [20]. However, it should be noted that these clustering approaches are inherently unsupervised learning algorithms which tend to ignore the category information. The label information in the crowded scene appears in pairs, and category information plays an important role in the recognition process. The spectral clustering methods fail to notice the association between categories, and are therefore not suitable for multi-label crowd behavior recognition. In view of this, we utilize dependencies among categories to construct subspaces in this paper. For a certain category, its subspace is generated on the basis of the association with other categories.
In this paper, crowd behavior recognition is deemed as a multi-label classification task, and each object is represented by a single instance when associated with multiple labels. Multi-label learning has been extensively studied during the past decades, and many algorithms have been proposed. For example, the simplest method is to decompose a multi-label task into a series of binary classification problems [21]. However, the method is essentially limited by overlooking the label correlations. In this connection, it stimulates research for coming up with approaches to capture and explore the label correlations in various ways. Some approaches, based on graph representation learning [6], [10], are proposed to capture the label correlations for multi-label recognition. Besides, a novel approach multi-instance multi-label fast learning (MIMLfast) is proposed in literature [5] to utilize the relations among multiple labels. However, it has to be admitted that given the imbalance of sample distribution in our work, the categories with a small number of samples obtain worse results in the classification process. In addition, the aforementioned methods optimize all categories globally, which is adverse to categories with fewer samples. Accordingly, we construct subspaces for these categories by utilizing the category association. Meanwhile, corresponding classifiers are designed for each category subspace.
Feature representation is an important factor for classification. Traditional manual features [22], [23] are gradually replaced by deep learning features. Recently, deep neural networks have been successfully applied to action recognition [24]- [26]. Previously, 2D convolutional neural networks [27], [28] trained by ImageNet [29] were usually exploited for RGB image classification. However, for the task of video classification, appearance information is not enough, and dynamic features representation play a vital role in the process of recognition [9], [30]. To simulate motion information, K. Simonyan et al. proposed a two-stream ConvNet architecture which incorporates spatial and temporal networks [8], where the temporal stream is trained to recognize actions from motion in the form of dense optical flow. Literature [7], based on two-stream architecture, made further improvement by introducing residual connections. Meanwhile, there are also some other works trying to adapt existing techniques to solve the action recognition task in videos [31]- [35]. In a nut shell, obtaining effective spatio-temporal feature representation is essential for action recognition. However, in our work, because even the same crowd behavior may have different scenes, the appearance noise will be relatively large; Meanwhile, crowd behavior is usually accompanied by a variety of motion information resulting in relatively large dynamic noise. Existing methods for fusing appearance and dynamic features can be influenced by noise. To effectively describe motion information in the video, we combine motion trend features with dynamic evolution features. To be specific, under the guidance of the motion trend, LSTM is used for collecting important evolution features, and discarding dynamic and appearance noise. Due to the limited expressive power of optical flow in complex motion scenes, we exploit a 3D convolution network [36] to obtain dynamic information. For the task of multi-label recognition, category information can enhance semantic representation. Hence, we integrate dependency relationship between categories into dynamic information to obtain the motion trend with semantic association. Then, dynamic features with strong semantic correlation is fused into frame-level static features. Besides, extensive research has shown that LSTM, as a variant of Recurrent Neural Network (RNN), demonstrates a strong ability to model long-term time dependency in sequence modeling. The LSTM network is also used for video action recognition. In this paper, characteristics of LSTM are used for filtering out the appearance and dynamic noise.

III. METHOD
The architecture of crowd behavior recognition is illustrated in Fig.2. It primarily involves three stages: construction of subspace, subspace classifier design and video feature representation. We construct a subspace for each special category before designing the subspace classifier to optimize the category subspace. Afterward, during representation of video features, we combine motion trend features with dynamic evolution features. Coupled with the motion trend, LSTM serves as a tool of filtering out dynamic and appearance noise. In the following sections, we will refine the process of subspace construction, subspace classifier design and feature representation. In order to make this paper easy to understand, we add a table of symbols (shown in Table 1).

A. CONSTRUCTION OF SUBSPACE
In the real world, the probability of some crowd behaviors is small, and it is difficult to obtain sufficient samples. The distribution of video data samples of crowd is thus extremely imbalanced. These categories with insufficient samples cannot achieve a good classification effect in the training process in spite of strong feature representation. For instance, experiments [6] on multi-label dataset (VOC 2007 [37]) show that the performance (mean average precision (mAP) is 92%) of categories with fewer samples (the number of samples is less than the average on training set) is inferior to that (mAP is 95.1%) of categories with more samples (the number of samples is larger than the average on training set). Simultaneously, the dataset WWW [1] is also imbalanced. According to literature [3], the average value (mAUC) of the area under Receiver Operating Characteristic curve is 89.3% (categories with fewer samples) and 92.4% (categories with more samples) respectively. In order to address the problem of imbalanced samples classification, the idea of subspace is proposed for crowd video categories with fewer samples.
The construction of subspace can be divided into two steps. Firstly, some rules and conditions are applied in determining categories with few samples according to the training set. Secondly, a screening mechanism is adopted to screen a series of categories from all categories and construct a subspace for each small sample category. The specific operation of the aforementioned two processes is as follows.
In order to clearly describe the construction of the subspaces, we first define the relevant symbols. Symbol c and n is the number of categories and the number of samples in the training set respectively. {R i } is the original category set, and R i is the i-th category. {R min j } refers to the set of small sample categories and contains e(e < c) categories, and r i is the number of each category samples in the entire training set, and r ij is the number of co-occurrence sample between the i-th category and the j-th small sample category.
Categories with fewer samples are determined by the following expression where the value of ε is set 0.8. A subspace j based on the j-th category in the small sample set {R min j } is formalized as where β 1 and β 2 are hyperparameters.

B. SUBSPACE CLASSIFIER
In this paper, since the crowd dataset WWW [1] is base on multi-label, we first utilize the sigmoid activation function to classify all categories. In the classification process, a probability value between 0 and 1 will be assigned to each category, and the categories are independent from each other. Categories with relatively sufficient samples can achieve decent performances by using the sigmoid activation function whilst categories with fewer samples obtain worse results under the same circumstance. Thus, to overcome this problem, we design special classifier (called subspace classifier) for these categories via introducing the idea of subspace. The design process of the subspace classifier is shown in Fig.2. Firstly, the main category (M 0 ) obtains corresponding subspace through category association. The subspace includes categories with a close relationship or strong distinction with M 0 , with association relationship between categories falling in the range [0, 1]. After the classification of the main category (M 0 ) by the subspace classifier, larger the number is, the closer the relationship is.
The subspace category association classifier is inspired by [3] attribute assignment (AA) and [38]. AA harnesses the subordinate relationship between categories for multi-label crowd behavior classification. Suppose that X ∈ R d * n represents a feature matrix comprising n training samples and d-dimensional feature vectors. Meanwhile, the mapping matrix U ∈ R d×c bridges low-level features with categories, and G ∈ {0, 1} n×c is employed to indicate the label matrix of the entire training set. According to [3], AA employs a convex optimization method with closed solutions where || • || 2 F is Frobenius (F-norm), γ and λ are hyperpara-meters. A ∈ [0, 1] c×c is the proposed dependency matrix which captures the interrelationship among categories. According to [3], the closed solution of the above convex optimization problem is formalized as where I is the identity matrix. Given the closed-form solution to optimization problems, the solution efficiency is very high. However, AA optimizes globally, which is not conductive to categories with few samples. To end this, we construct a subspace to distinguish categories with fewer samples and design a classifier based on the idea of subspace and called subspace classifier. In our framework, each category with fewer samples is configured with a special subspace, which is different from AA algorithm. For each category with few samples, the subspace classifier is used for optimizing the corresponding subspace. In a subspace, the category to be distinguished is deemed as the main category.
According to [3], [38] the optimization function of the subspace is defined as where U (j) ∈ R d×c (j) is the mapping matrix of the main category (j) on the subspace, which establishes a bridge from features to category information. The matrix G (j) ∈ R n×c (j) is the label matrix of main category (j) on the subspace. Symbol c (j) represents the number of categories on the subspace for the main category (j). The symbol S (j) ∈ R c (j) ×c (j) represents correlation matrix of main category (j) on the subspace, and W (j) ∈ R c (j) ×c (j) is weight matrix, and (j) is regularization term. Then, we will introduce the matrices S (j) , W (j) and regularization term (j) in turn. According to [3],the element of matrix S (j) is formalized as In order to facilitate training, we adjust the order for categories to make the main category the first column, while (j) 1 (formula (2)) is correspondingly the main category. The proportion of the main category is adjusted by the category correlation matrix S (j) of (j) . The matrix W (j) weakens the main category and weights the association relationship between categories. The element of W (j) is formalized as where the value range of α is between 0 and 0.3. Symbol e represents the number of subspaces. The matrix W (j) , according to formula (7), is expressed as where each element on the main diagonal represents the proportion of each category on the subspace. The value of W (j) 11 is α, indicating that the main category (j) is weakened because the value of α is less than 0.3. Other element values on the main diagonal are 1−α, implying that the proportion of other categories is strengthened on the subspace. The relationship between the main category and other categories is established by setting the value of the element to α (the value of α in the matrix W is the same). Specifically, the element W (j) 1k is α. The element value is 0 in the matrix, suggesting that there is no dependency between two categories. If W (j) is an identity matrix, following formula (3), the classifier will be attribute assignment (AA). If not, according to formula (8), the classifier will be the subspace classifier. In addition, the regularization term (j) is defined according to [3] (j) and U (j) is formulated as where S (j) 1 represents the first column in S (j) . VOLUME 8, 2020

C. FEATURE REPRESENTATION
How to effectively capture the distinct spatio-temporal features to model the spatio-temporal evolution of different actions is crucial for video action recognition. As shown in Fig.3(a), in the traditional method [8], the spatial stream captures still frame-level features, whilst the temporal stream captures dynamic features in the form of dense optical flow. Finally, the class score is average of static score and dynamic score. However, joining static and dynamic features together could easily be subject to influence of noise. In order to obtain a discriminative feature representation in crowded scenes, we adopt the approach of combining motion trend features with dynamic evolution features rather than the traditional method. The overall framework of our approach is shown in Fig.3(b), which is composed of two main modules: dynamic evolution features and motion trend features.

1) DYNAMIC EVOLUTION FEATURES
With regard to the task of video action recognition, LSTM network is selected to capture context information, as shown Fig.2. The LSTM network is composed of two layers. In detail, the first layer 'lstm' unit obtains each state output of the video sequence, while the output of the second layer 'lstm' for the last time step is applied for classification. The input of the LSTM network is the frame-by-frame fusion of the motion trend features and static frame-level features where X S t represents the static features of the t-th frame, and X M represents the motion trend features. The static frame is uniformly selected l frames from the original frame sequence, as is shown in Fig.3(b).
We can use any CNN basic model [27], [28] to capture the frame-level features of the video. In our experiments, following [1], [3], the ResNet50 model [28] pre-trained on ImageNet [29] is chosen as the backbone CNN. Therefore, if the input video frame X I t is a resolution of 224 × 224, we can obtain the 7 × 7 × 2048 feature map from the last 'conv' layer. Then, we adopt average pooling (AP) to obtain where η cnn represents CNN neural network model, m = 2048.

2) MOTION TREND FEATURES REPRESENTATION
To effectively describe motion trend features (X M ), the global correlation between labels is applied to the 3D dynamic feature map to figure out dynamic features with semantic association, with the overall framework depicted in Fig.3(b). First, we feed τ frames into a 3D convolution network for describing the overall motion trend. Then, category information is employed as the input of GCN to explore semantic association, ending up with the combination between semantic association and motion trend. In our work, in line with [10], [39], we have used stacked GCNs, where η gcn is a GCN neural network model, and f act is a activation function (We employ sigmoid function in this research). Category association graph contains X C and A in Fig.3(b). X C ∈ R c×q is the feature matrix of the category. Each category is presented using q dimension word vector, and c represents the number of categories. A ∈ R c×c is an asymmetric association matrix between categories. The symbol X G is a matrix with category information, z is the dimension of the dynamic feature map.
In our experiments, the WWW [1] dataset has been trained using the MF-Net [36] model pre-trained on kinetics [40]. If the clip of each video input is 16 × 224 × 224, we can obtain the 8 × 7 × 7 × 768 feature map from the 'conv5' layer. Then, average pooling (AP) is employed to obtain 3D dynamic feature X D where η 3d represents 3D neural network model, z = 768.
Thus, the motion trend feature (X M ) is obtained We assume that the ground truth label of the video is represented as G ∈ R n×c (G ij ∈ {0, 1}). The LSTM network is trained with the following loss function on the basis of X L (17) where sig(•) is the sigmoid function [29], andG ij represents the prediction value. In process of inference, general categories obtain probability information through the trained LSTM model. For small sample categories, the map of the previous layer of sigmoid function in LSTM network is taken as a video feature (X), and then subspace classifier is used for class prediction.

IV. EXPERIMENT
In this section, we first describe evaluation metrics and implementation details. Secondly, we report the experimental results of the multi-label crowd dataset WWW. Then, subspace classification results are analyzed before we further conduct ablation study to evaluate the key aspects of the proposed approach.

A. EVALUATION METRICS AND IMPLEMENTATION DETAILS
In all experiments, we use the area under receiver operating characteristic (ROC) curve (AUC) and average precision (AP) as the evaluation indicators. AUC is a popular classification indicator for measuring the classifier performance. AP can be effectively used for measuring classification results of each category. To fairly compare with existing methods, we also adopt AP and AUC on each category.
We use 3D dynamic information, category feature representation and association matrix as input to train the GCN network. According to the trained model, the product of the 3D dynamic feature and the output of the last layer of GCN is used as feature with semantic information. In the process of training, according to literature [10], we used two GCN layers with the output dimensionality of 384 and 768, respectively. As is depicted in Fig.3(b), the input of GCN includes the category feature matrix and association matrix A. Each category feature is represented by a q dimensional word vector (q = 300). We get a word vector model by training the Wikipedia dataset [41] and categories of WWW together. For network optimization, SGD is used as the optimizer. The momentum is set to be 0.9. Weight decay is 10 −4 . The learning rate is 0.001.
Our LSTM network (in Fig.2) consists of two LSTM layers with output dimensionality of 512 and 768, respectively. In the process of training, the regular dropout is set to 0.4 in LSTM network, thus speeding up the convergence in experiments. Adam is used as the optimizer for network optimization, while the binary cross entropy is adopted as the loss function.

B. EXPERIMENTAL RESULTS
In this part, we present a comparison of our proposed method with state-of-the-art methods on dataset WWW first. Then, quantitative evaluation results of dataset WWW are reported.

1) A COMPARISON OF OUR METHOD WITH STATE-OF-THE-ARTS
WWW [1] is a multi-label large-scale crowded dataset, which contains 10, 000 video clips and 94 different categories. According to [1], the dataset is split into training, validation, and test sets at a ratio of 7 : 1 : 2. We use cross-validation on the training and validation sets according to a ratio of 9 : 1. Finally, we evaluate the accuracy of the model on the test sets.
We conduct a comparison of our proposed method with state-of-the-art methods, including DLSF+DLMF [1], DLF+ DLFO+AA [3], S-CNN [2], MIML [5] and CLDF [42](shown in Table 2). In this paper, a model based on the LSTM network (DGSF-LSTM) is proposed, and the feature representation with category association is used as the input of the network. 1) DLSF+DLMF [1]. A deep model is used for learning the features of each category from the appearance and action information of each video, and the learned model is used for identifying unknown categories in the crowded video. 2) DLF+DLFO+AA [3]. The dependence of category information is used for obtaining a mapping relationship between categories and features to achieve better scene classification effect, and a low-level feature extraction mechanism is also used for obtaining more descriptive feature information. 3) S-CNN [2]. A new sliced convolutional neural network is proposed, which exploits 2D filters. Spatial filters obtain appearance information, and time slices capture dynamic clues. This method shows a strong ability to capture spatio-temporal features. 4) MIML [5]. A simple linear model is adopted. By using the relationship between multiple labels, the model can learn a shared space for all labels from the original features, and then trains a label-specific linear model  from this space. This linear model reduces the number of parameters and speeds up training. This very linear model to multi-label crowd behavior recognition is applied in the current paper. 5) CLDF [42]. The class-level difficulty factors for multi-label classification are proposed in the literature.
We reproduced experiments according to the idea of [42]. ResNet50 model [28] pre-trained on Image-Net [29] was adopted as backbone. Then, we retrained the network on basis of dataset WWW [1]. Finally, classification results are 95.7% and 69.8% respectively in terms of mAUC and mAP. Based on it, the subspace idea was used to predict small sample categories (CLDF [42] + subspace). Both mAUC and mAP results (mAUC, 0.961, mAP, 0.699) outperformed results of [42]. It could be concluded from experiments that the subspace idea proposed in this paper is effective against unbalanced sample data sets. Simultaneously, we compared CLDF [42]+subspace method with ours (DGSF-LSTM+subspace ) on 94 categories of WWW-datasets. Experiments show that features obtained by LSTM perform better in categories of action information, such as stand, sit, walk, run, swim, dance, photography, dining, shopping, et al. However, CLDF [42] has more advantages in categories of scene information, such as indoor, outdoor, airport, street, stadium, concern, square, beach, school, et al.

2) QUANTITATIVE EVALUATION
Quantitative evaluation results are shown in Table 2, showing that our model is efficient for crowded scene classification. In particular, the performance of our subspace model (DGSF-LSTM + subspace) in terms of the mean AUC reaches the state-of-the-art level.
To evaluate our feature representation, other comparison experiments are also conducted, as demonstrated in Table 2. Under the condition that the pre-trained model is VGG-16 [27] (especially mAP), the proposed DGSF (the fusion of static features and dynamic features with category information) is 7% higher than SF (static features) and 6% higher than DSF (the fusion of static features and dynamic features), indicating that category association can facilitate the recognition of video action. Meanwhile, the experiments under ResNet50 and VGG-16 display that the ResNet50 network is superior to VGG-16.
We are also interested in the performance of each category. Fig.4 shows the AUC and AP values for all categories through the DGSF-LSTM method. Some categories (like ''indoor'', ''Outdoor'', ''Street'' and ''Performance'') shown in the line chart can obtain better classification results through the fusion of static features and dynamic features with category information (DGSF), which means categories with a relatively large number of samples can converge easily in the process of training. However, as for categories with only a few samples, such as ''police'', ''queue'' and ''disaster'', we cannot obtain good classification results using LSTM network and sigmoid activation function.

C. SUBSPACE CLASSIFICATION
In this part, the subspace classification results are mainly reported. To analyze and understand crowd behaviors, in the existing studies [1]- [3], only experiments on the dataset WWW are conducted, so we apply the idea of subspace to the dataset WWW in this paper. The research goal of this paper is the crowd behavior recognition, however, the existing crowd datasets WorldExpo'10 [43], UCF-CC-50 [44], UCF-QNRF [45] and Shanghaitech [46] being applied for crowd counting. Other crowd datasets, such as S-Hock [47] and Violent-Flows [48], do not refer to multi-label. From our perspective, the dataset WWW is not enough, and the image-based multi-label dataset VOC 2007 (an imbalanced image dataset) is selected to illustrate the effectiveness of the subspace idea [37]. In the following sections, we will specify subspace classification results on the datasets WWW and VOC 2007.

1) SUBSPACE CLASSIFIER ON THE DATASET WWW
Firstly, categories with few samples in the training set, shown in Table 3, are determined according to formula (1). Then, according to formula (2), a subspace for each small sample category is constructed. For example, the subspace of category 'knell' has 76 categories while the original space has 94. Finally, according to formula (5) − (11), the subspace classifier is designed to optimize the corresponding subspace of category 'knell'. As demonstrated in the Table 3, the performance of category 'knell' has been significantly improved.
In addition, we also apply the idea of subspace to the classifier MIML [5] (MIML + subspace) and report experiment results. MIML is a multi-label classifier with category association, which can directly to integrate subspace idea. Table 3 indicates that the performance of most categories has shown a significant improvement by adopting the idea of subspace. However, the experiment result of DGSF-LSTM + subspace is superior to that of MIML+subspace, which shows that MIML [5] has certain limitations on categories with fewer samples. The mAUC of these categories is increased by 3.4% overall while the mAP of those is increased by 1.3% through the DGSF-LSTM+subspace. However, the performance of category 'attend classes' has declined after exploiting subspace. Since the category has achieved decent performance on the DGSF-LSTM model (mAUC is 99.8%), other classifiers will not be able to play a role in improving performance. CLDF [42] uses a variety of skills to extract video features suitable for multi-label classification. For the prediction of small sample categories, we use subspace classifier based on the features of CLDF. Experimental results show that subspace classifier can also be applied to the basis of CLDF features.

2) SUBSPACE CLASSIFICATION RESULTS ON THE DATASET VOC 2007
PASCAL Visual Object Classes Challenge (VOC 2007) [37] is another popular dataset for multi-label recognition. It contains 9, 963 images from 20 object categories, which is divided into train, val and test sets. For fair comparisons, we use the trainval set to train our model, and evaluate the recognition performance on the test sets.
To evaluate the subspace idea, we conduct the same experiments on the dataset VOC 2007. Firstly, we select categories with fewer samples according to formula (1). These categories are shown in Table 4. Then, according to formula (2), we construct subspace for each category. Finally, we retrain the model SSGRL [6] by using the generated subspace of per category and obtain classification result. It is obvious in Table 4 that most of categories adopting subspace have better performance than the original model. Overall, the mAP of these categories is increased by 0.5%.

D. ABLATION STUDIES
In this section, we perform ablation study from three different aspects, including the number of video frames, the value of the parameter γ and λ and the replacement of the LSTM network structure with Gated Recurrent Unit (GRU).

1) VIDEO STATIC FRAMES
In order to fairly compare the test results, the number of static frames for the video in the test is the same as the number of frames in [1], [3], which are both 75. In this paper, we also  report the test results of other frames. The current results are based on the (DGSF-LSTM) model. Table 5 shows the mAUC and mAP of 5, 15, 25, 50 and 75 frames. However, the mAUC of 5 frames is only 0.12% less than the 95.4% of 75 frames. It shows that by reducing the number of frames and appropriately increasing the span of the video frames, relatively good classification results will be obtained.
However, the efficiency advantage of using fewer video frames is greater. In the case of 2080ti GPU configuration and 75 frames strategy, it takes 1224ms for a short video to complete category prediction. But it only takes 94ms for a video to complete category prediction with the strategy of 5 frames.

2) EFFECTS OF DIFFERENT PARAMETER VALUES
Results shown in Fig.5 are the average AUC of small sample categories(obtained according to formula (1)). We varied values of parameters γ and λ at the same time and tested them in training set, and plotted final results by a 3D mesh graph (shown in Fig.5(a)). In order to explore changing trend of the two parameters, we plotted cross-sections of the middle grid in two different directions in 3D grid (shown in Fig.5(b) and Fig.5(c)). In addition, we set empirically values of β 1 and β 2 (in formula (2)) to 0.05 and 0.01, respectively.

3) GRU STRUCTURE
In our work, we also conduct some experiments under the GRU structure. GRU is a variant of LSTM, which combines the forget gate and the input gate into a single update gate. The parameter of GRU is less than LSTM. The final model is simpler than the LSTM model and is not easy to overfit. Experiments show that the results under the GRU structure (mAP is 67.3% and mAUC is 95.6%) are higher than the LSTM, when the number of frames is 75. It could be down to the fact that convergence is easier on the premise of GRU when the number of samples is insufficient.

V. CONCLUSION
Multi-label action recognition entails prediction of labels co-occurring in videos. Due to the imbalanced distribution of samples, some categories with fewer samples do not converge easily. In order to solve this problem, we introduced an idea of subspace. The subspace method not only harness association among categories, but also simplify the distribution of categories. Meanwhile, a classifier based on subspace is also designed for better classification results. In addition, in crowded scenes, to obtain discriminative feature representation, we injected dependence relationship among categories into dynamic information, strengthening the latter with a stronger semantic relationship. Then, dynamic features with strong semantic correlation are fused into frame-level static features.
In conclusion, the association information of categories is utilized to compensate for the imbalance of samples. Our method can be regarded as a fundamental technique that shows potentials of other related applications. For example, a new category is added in the recognition process, but the number of samples for this new category is insufficient. In this case, we can construct a subspace for the current category based on the idea of subspace, and address the problem of imbalance samples under the guidance of association relationship among categories.