Low-Rank Regularized Multimodal Representation for Micro-Video Event Detection

,


I. INTRODUCTION
Currently, with the rapid development of communication technology and portable devices, micro-videos are becoming a new trend of user-generated content from social media platforms. Compared to traditional forms of text and picture, micro-videos offer a more lively style for recording and exhibiting people's lives. Micro-videos are first produced with the aim of exploiting the fragmented time of users, and thus, have a very limited time length. For example, TikTok 1 and Instagram 2 are well known micro-video platforms, prescribing a limit of 15 seconds at most for ordinary users.
The study of micro-videos becomes a promising research field attracting increasing attention. Several pioneering efforts have been made to ease the problem of micro-video intelligence analysis. For example, Redi et al. [1] attempted to understand the computational features that are correlated with the creativity of micro-videos. Cao et al. [2] The associate editor coordinating the review of this manuscript and approving it for publication was Min Xia . 1  proposed exploiting an attention mechanism and proposed a neural collaborative filtering model to address the problem of micro-video group recommendation. Chen et al. [3] introduced a transductive multimodal learning model to predict micro-video popularity, in which the modality relatedness and insufficiency limitation can be alleviated by finding a common shared subspace. To better estimate the venue category of micro-videos, Nie et al. [4] introduced a deep transfer model to enhance the semantic representations of micro-videos by exploiting external sound knowledge. Liu et al. [5] developed a tree-guided multimodal dictionary learning algorithm to solve the organization problem of online micro-videos. Wei et al. [6] presented a relation-aware driven neural multimodal cooperative learning model to characterize the complex correlations among different modalities.
In addition, the analysis of event semantics of micro-videos also plays a significant role in many potential applications, such as automatic annotation, retrieval, and supervision. Nevertheless, very few studies have attempted to solve the event detection problem of micro-videos. Several limitations could explain the shortage of relevant studies. Firstly, although there are already some relevant work on event recognition in traditional videos and still images [7], [8] [9], [10] [11], [12] [13], [14], which reveal high-level semantic cues, such as semantics of object and scene, human poses, and garments, can provide informative priors for event detection. For example, Wang et al. [15] presented an approach for event driven web video summarization by tag localization and key-shot mining. Hong et al. [16] proposed a novel scheme that is able to summarize the content of video search results by mining and threading ''key'' shots, such that users can get an overview of main content of these videos at a glance. Zhao et al. [17] proposed to recognize video emotions in anend-to-end manner based on convolutional neural networks(CNNs). Zhao et al. [18] introduced a novel real-time event detection method by generating an intermediate semantic level from social multimedia data, named microblog clique (MC),which is able to explore the high correlations among different microblogs. As to microvideos, it not only involves similar integration and interaction of multiple high-level semantics but also exhibits more serious variations across visual appearance and spatial structure,even for the same event category. Specially, a video event usually encompasses the continuous behavior with continuous variations that do not only evolve in temporal domain, but also involve the interactions among multiple objects in spatial domain. Compared with the static concept of objects and scenes, the representation learning of the video events tend to be more complicated. Secondly, due to the short nature of time length and certain practical factors, such as postproduction and editing, irrelevant or noisy information is usually embedded in the source micro-videos, making the valuable cues hidden in the original representations. Furthermore, another significant obstacle to the study of micro-video event detection is the lack of publicly available datasets.
To address these limitations, in this paper, we propose a low-rank regularized multimodal representation learning method for micro-video event detection. We first characterize micro-videos by multiple modalities and eliminate the heterogeneity and redundant information among modalities by maximizing the correlation among different modalities. inspired by some recent low-rank representation [19], [20] [21], [22] [23], [24] and multimodal [25], [26] methods, a considerable gain in detection accuracy then is achieved by further exploring the low-rank representation and the label-relaxed strategies [27]. Specifically, the lowest-rank representation shared by all modalities that captures the intrinsic information is learned by enforcing a low-rank constraint. Meanwhile, the label-relaxed strategy that provides more freedom for the mapping from feature to label spaces is developed to relax the strict binary label assignment into a soft constraint. Moreover, we constructed a newly micro-video event detection dataset to verify the effectiveness of our proposed method.
The rest of this paper is organized as follows. Our proposed algorithm is described in Section 2. The performance evaluation of our proposed algorithm is discussed in Section 3, followed by the conclusion in Section 4.

II. PROPOSED METHOD
Assuming that we are given a set of N micro-videos is the feature vector of x i extracted from the k-th view, K is the number of views, and D k is the feature dimension of the k-th view.
Micro-video event detection involves a union of multiple types of features. Although multiple representations of microvideos are complementary and mutually beneficial to each other, the heterogeneity among modalities tends to make them distinct from different subspaces. To eliminate this problem, we first consider mapping the original observations to low-dimensional subspaces through a set of viewspecific embedding models {W 1 , · · · , W K }. Considering low-rank techniques for denoising [28], we then decompose the mapped observations into the latent low-rank semantic dictionaries {U 1 , · · · , U K } with their corresponding reconstruction errors {E 1 , · · · , E K } to derive a latent common representation Z. Accordingly, we formulate the backbone of our proposed method as follows: where Z encodes coefficients of mapped features based on semantic dictionaries; · * denotes the nuclear norm to enforce low-rank solutions; · 1 and · 2,1 are the 1 -and 2,1 -norms, which are restricted on the reconstruction errors to simultaneously handle random corruption and maintain the robustness for outliers; γ 1 and γ 2 are balanced parameters.
To enhance the representation ability of Eq. (1), we further consider the followings:

1) CORRELATION ANALYSIS
To obtain the common representation of all modalities, we maximize the total correlation across all pairs of modalities while discarding part of the redundant information as follows: where I is an identity matrix and S ij = X i X T j is defined as the covariance matrix of the i-th and the j-th modalities.

2) LABEL RELAXATION
We denote the binary label matrix of micro-videos as Y = [y 1 ; y 2 ; · · · ; y N ] ∈ {0, 1} N ×C , where y i ∈ R C is the label vector of the i-th micro-video, and C is the number of classes. If the i-th micro-video belongs to the j-th event type, we denote Y ij = 1, and zero otherwise. We first represent micro-video event detection in the regression form as the backbone for reasons of efficiency, i.e., min γ A 2 F , where A is the coefficient matrix building the connections between latent feature representations and the label matrix, and γ is a balanced parameter. We then consider two improvements: one is to relax the strict binary label assignment into soft label assignments, and the other is that the geometric structure in each modality could be preserved to guide common representation learning.
For the first improvement, the widely used binary label assignment is too rigid to provide more freedom in the mapping from features to label spaces. To address this problem, we introduce a nonnegative label-relaxation matrix M that combines with the indicator matrix B to form a slack label matrix Y+B M, where is the Hadamard product operator. B is set to B ij = 1 when Y ij = 1, and B ij = −1 otherwise. Therefore, the label-relaxed regression model can be reformulated as follows: For the second improvement, a straightforward method is to construct a common shared graph Laplacian to preserve the geometric structure consistency among modalities, is the graph Laplacian matrix for the i-th view, S k is the weight matrix computed by the Gaussian similarity function, and D k is the diagonal degree matrix with D k ii = j S k ij . The main idea of the graph regularizer is that if the two samples are similar in original feature space, the two samples should also be close neighbors in the latent low-rank space, and vice versa. Different from the general methods of calculating the Laplacian matrix in feature space, in our proposed algorithm, a unified graph structure is calculated build connections between feature and labels in order to ensure the reliability of the prediction results and the robustness of our proposed algorithm. By combining the low-rank multimodal embedding and the label-relaxed discriminant learning terms, we formulate the compact optimization problem as follows: The meaning of each symbol in the objective function is summarized in TABLE 1.

3) OPTIMIZATION
Due to the nonconvex property of Eq. (4), it is difficult to achieve the global optimal solution. We develop an effective optimization algorithm by iteratively updating each variable with fixing the other variables. We first introduce two reflexed variables E 1 and E 2 to obtain the augmented Lagrangian function [29] as follows: where Y 1 , Y 2 , Y 3 and Y 4 are Lagrange multipliers and µ > 0 is a penalty parameter. Update W: We can update W as the following scheme: By setting the derivative of Eq. (5) regarding W to zero, we obtain where Update U: We can update U as follows: whereÛ is the optimal solution in the previous iteration, The problem in Eq. (7) is a standard nuclear-norm minimization problem, which can be approximately solved by the singular value thresholding (SVT) algorithm [30].
Update Z: We can update Z by dropping the terms independent of Z as follows: By setting the derivative of Eq. (8) regarding W to zero, we obtain the closed form solution of Eq. (8) whereĜ = W T X − E + Y 1 µ . The above equation can be obtained by solving the Lyapunov equation [31].
Update E: We can update E as the following scheme: The closed form solution of Eq. (10) is obtained as Update E 1 : We can update E 1 as the following scheme: where D γ 1 /µ[·] is the self-defined operator. Given an arbitrary matrix Q = [q 1 , q 2 , · · · , q i , · · · ] and the parameter θ, the operator D θ [Q] is defined as follows: We can update E 2 as follows: where S γ 2 /µ[·] denotes the shrinkage operator [32]. Update A: We can update A as the following scheme: The closed form solution of Eq. (14) is obtained as Update M: We can update M as the following scheme: The optimal solution of Eq. (16) can be obtained as follows: where Q = Z T A − Y.

III. EXPERIMENTAL RESULTS AND ANALYSIS A. DATASET AND EXPERIMENT SETTING
We constructed a micro-video event detection dataset from Flickr through its public API. 3 We collected 20 event types with reference to TRECVID MED 2013 [33] and While not converged do 2) Fix the others and update W t+1 : Fix the others and update U t+1 : Fix the others and update Z t+1 : Fix the others and update E t+1 : E t+1 = 1 3 G 5,t + G 6,t + G 7,t ; 6) Fix the others and update E 1,t+1 : Fix the others and update E 2,t+1 :

8) Fix the others and update A t+1 :
−1 ϕZ t+1 G 3,t ; 9) Fix the others and update M t+1 : M t+1 = max (R t B, 0); 10) Fix the others and update Y 1,t+1 , Y 2,t+1 , Y 3,t+1 , Y 4,t+1 , and µ t+1 :  FIGURE 2. Specifically, we first crawled no fewer than 5,000 micro-videos for each event type. To prevent a mismatch between the original labels and the ground truths, three human annotators were asked to vote to clean up microvideos, and the micro-videos with low resolution were also removed. Eventually, 12,318 micro-videos were preserved in total, and approximately 75% of the micro-videos were less than 10 seconds. The statistics of our constructed Flickr micro-video event detection dataset is shown in FIGURE 3. We randomly selected 9,270 micro-videos as the training set and the remaining micro-videos as the test set.
All experiments were repeated three times for random training and test partitions and were reported as the average performance. Moreover, the trade-off parameters were all tuned by a grid-search strategy. We adopted the mean average precision (mAP), recall, and precision as metrics to measure the accuracy of the results.

B. FEATURE EXTRACTION
Because of the strong performance of the deep convolutional neural network (CNN) [34]- [36] [37], [38]in various visual recognition tasks, we extracted deep object and scene representations to characterize micro-videos. In more detail, we employed the two well-trained VGGNets [39] as our feature extractors, which were trained with 1.2 million images of 1,000 object classes from the ImageNet dataset [39] and 1.8 million images of 365 scene categories from the Places365-standard dataset [40]. For each feature extractor, the outputs of the fully connected layer fc6 were treated as high-level semantic representations [41], which resulted in 2048D feature vectors. Moreover, to obtain more robust representations, we adopted a keyframe-based strategy [42], i.e., the final features were obtained by conducting average pooling to fuse features extracted from keyframes.

1) EVALUATION ON CONVERGENCE
We investigated the convergence of our proposed alternating algorithm. In addition to the objective function values, we also calculated the convergence of Z by measuring the variance between two consecutive measurements since it is directly used to perform event detection. FIGURE 4 shows the convergence curves based on our proposed optimization algorithm. From the figure, we can see that both curves decreased rapidly as the number of iterations increased. In particular, the solutions stabilized the convergence curve after approximately 40 iterations, which guarantees the stability of our proposed optimization algorithm.

2) EVALUATION ON PARAMETERS
Among all parameters, the parameters α, β, and ϕ played significant roles in our proposed method since they constrained the impact of geometrical structure consistency, correlation maximization, and label relaxation, respectively. FIGURE 5(a), FIGURE 5(b), and FIGURE 5(c) show the detection performance with different values of parameters on our proposed method. In our experiments, the parameters α, β, and ϕ were selected using a grid-search strategy in a heuristic manner, ranging from 0.00001 to 1, 0.01 to 100, and 0.0001 to 10, respectively. From the figures, we can observe that the best performance was obtained when α = 0.001, β = 0.7, and ϕ = 0.01. If these three parameters were too large, they led to a noticeable performance drop. We also changed the percentage of labeled samples from 30% to 100%. As depicted in FIGURE 5(d), we can see that the performance improved significantly with sufficient labeled information.

3) EVALUATION ON TRAINING SAMPLES
This section mainly discusses the influence of changing the size of training set on the micro-video event detection results based on low-rank multimodal representation learning. Specifically, a new training set is constructed by randomly extracting samples from the original training set according to different proportions, and the size of the original test set is unchanged. The experiments are carried out without kernel function and with kernel function (linear kernel) respectively, and the changes of detection results are shown in FIGURE 6(a) and FIGURE 6(b). From FIGURE 6(a), we can see the detection results without kernel function algorithm. When the training set increases to 70%, the mAP and Precision curve begin to flatten. On the one hand, it shows that increasing the size of training set can effectively improve the accuracy of detection results when the size of training set is small. But on the other hand, it also shows that there is a limit to the improvement of the detection results. When the size of training set reaches a certain amount, the accuracy of the detection results almost do not improve. This shows that there is a threshold for the accuracy of the detection results with the changes of the size of training set. From FIGURE 6(b), we can see the detection results with kernel function algorithm.When the size of training set increases gradually, the mAP and Precision curves keep VOLUME 8, 2020  rising. Compared with the training set without kernel function processing, the accuracy of the overall detection results is higher and the potential of accuracy growth is greater after kernel function processing.

4) EVALUATION ON ABLATION ANALYSIS
In order to analyze the necessity of each component in our objective function, we calculated the changes averaged experiment results by removing different modules. Specifically, we conducted experiments as follows: Specifically, the objective function is processed as follows: noM: we removed the label-relaxation matrix M to test the performance of by setting M = 0. noGraph: we remove the introduced graph regularization operator, namely the Laplacian matrix by setting γ = 0.
noMCCA: we eliminated the influence of correlation analysis by setting β = 0. The final recognition results of each evaluation standard after removing different ingredients are shown in TABLE 2.
The following conclusions can be drawn from TABLE 2: 1) the most influential component is noMCCA, which encodes complementarity and correlation among modalities in our proposed algorithm; 2) the prediction results of both noM and noGraph are worse than the complete algorithm, which shows the necessity of introducing graph regularizer and label-relaxation matrix; 3) the result of noM is worse than that of noGraph, illustrating that the graph regularization term produce significant influecne than the label relaxation term.

5) COMPARISON WITH STATE-OF-THE-ART METHODS
We compare our proposed method with several stateof-the-art methods, including support vector machine (SVM) [43], supervised regularization-based robust subspace (SRRS) [44], low-rank common subspace (LRCS) [45], discriminative transfer subspace learning (DTSL) [46], robust multiview subspace learning (RMSL) [47], collective low-rank subspace (CLRS) [48], Nearly-isotonic support vector machine (NI-SVM) [49], bilevel semantic representation (B-LSA) [8], dEep pArallel Sequence wiTh sparsE constraint(EASTERN) [50], Deep trAnsfeR modEl(DARE) [4], 3D CNN (C3D) [51], and inflated 3D ConvNet (I3D) [52].   3 shows the recognition performance compared with several state-of-the-art algorithms. Obviously, our method outperformed all the other algorithms in terms of mAP, precision, and recall. Benefiting from the multiview property, RMSL, CLRS and our proposed algorithm performed much better than SRRS, LRCS, and DTSL. Our algorithm outperformed RMSL and CLRS. One possible reason is that the geometric structure in each modality is preserved in our model to guide better representation learning.In the deep learning method, we choose two convolutional neural networks, C3D and I3D for event detection and action recognition. The results are very common. At the same time, we choose four methods NI-SVM, B-LSA, EASTERN and DARE, which use deep concepts to detect micro-video events. The results are better than those of C3D and I3D, which shows that the potential semantic features in these videos can represent complex events better.

IV. CONCLUSION
In this paper, we proposed a label-relaxed multimodal lowrank representation algorithm to address micro-video event detection. To learn more intrinsic representations of microvideos, the complementarity among modalities was explored to address the less descriptive power of each modality. The low-rank and label-relaxed constraints were further exploited to ensure more intrinsic representations. Experiment results conducted on our constructed dataset verified its superior performance. Considering the complex pattern among different modalities, especially in scenarios with modality missing. In the future, we will develop a novel and flexible multimodal representation framework to deal with the problem of missing modalities.