Cholesky Decomposition-Based Metric Learning for Video-Based Human Action Recognition

Video-based human action recognition can understand human actions and behaviours in the video sequences, and has wide applications for health care, human-machine interaction and so on. Metric learning, which learns a similarity metric, plays an important role in human action recognition. However, learning a full-rank matrix is usually inefficient and easily leads to overfitting. In order to overcome the above issues, a common way is to impose the low-rank constraint on the learned matrix. This paper proposes a novel Cholesky decomposition based metric learning (CDML) method for effective video-based human action recognition. Firstly, the improved dense trajectories technique and the vector of locally aggregated descriptor (VLAD) are respectively used for feature detection and feature encoding. Then, considering the high dimensionality of VLAD features, we propose to learn a similarity matrix by taking advantage of Cholesky decomposition, which decomposes the matrix into the product between a lower triangular matrix and its symmetric matrix. Different from the traditional low-rank metric learning methods that explicitly adopt the low-rank constraint to learn the matrix, the proposed algorithm achieves such a constraint by controlling the rank of the lower triangular matrix, thus leading to high computational efficiency. Experimental results on the public video dataset show that the proposed method achieves the superior performance compared with several state-of-the-art methods.


I. INTRODUCTION
Video-based human action recognition aims to recognize and understand the actions and behaviours in the video sequences. It has attracted increasing attention in computer vision, mainly because of its wide variety of applications, such as health care [1] and human-machine interaction [2]. For example, in the health care domain, the human actions are automatically recognized in real time and thus the corresponding medical rescue can be provided timely based on video-based human action recognition. In the report of literature [29], some actions, such as running, jumping, climbing and falling, are viewed as potentially dangerous for seniors and children, and action recognition based on video can greatly help to reduce the occurrence of accidents.
The associate editor coordinating the review of this manuscript and approving it for publication was Wenbing Zhao .
After intensive researches during recent years, the videobased human action recognition technology has made great progresses and developments. However, due to the dynamic appearance changes of the human [33] caused by the internal and external factors, for example, illumination changes, different attitudes, fast actions of the cameras, and the dramatic changes in actions, video-based human action recognition in complex environments still faces huge difficulties and challenges.
During the past few decades, the significant progress has been made and many methods have been developed on videobased action recognition [3]- [6]. These methods usually consider video-based action recognition as a supervised learning process, where the relationship between different actions is modelled via the training set, and then the learned model is used to recognize actions in the test set. For example, Subhransu et al. [4] first extracted a set of action-related features, and then the action is classified by using Support Vector Machine (SVM). Maji et al. [5] used a trained body detector for human detection, and then employed a non-linear SVM for classification. Yao et al. [6] extracted the features by dense sampling, and then utilized random forest to distinguish different actions. The above methods are based on manuallydesigned features, such as Histogram of Oriented Gradient (HOG) features [7] and Scale-Invariant Feature Transform (SIFT) features [8].
Currently, several methods [9], [10], [28], [30] rely on deep learning to perform video-based action recognition. For example, Gkioxari et al. [9] proposed to use R*NN for context-based action recognition. Peng and Schmid [10] developed a multi-region two-stream R-CNN method for action detection. Chen et al. [28] proposed a spatiotemporal heterogeneous two-stream network for video action recognition, which employs two different network structures for spatial and temporal information, respectively. Tang et al. [30] proposed a Semantics Preserving Teacher-Student (SPTS) networks architecture, which is applied on action segmentation task. Although the excellent performance achieved by the deep learning based methods, these methods often require the large-scale training data, and thus the training complexity is relatively high [11]. In this paper, we focus on the small sample size problem (i.e., the training samples are limited), where the manually-designed features are exploited.
In general, the features extracted from the video sequences are in high dimensions. However, the high-dimensional features often contain significant redundant information and noises, which will cause the performance decrease of the subsequent classification. Therefore, metric learning (which usually learns a distance/similarity metric) can not only play the role of dimensionality reduction, but also improve the recognition rate by matching the similarity between features for video-based action recognition. Specifically, similarity metric learning or distance metric learning uses a training sample set to obtain a metric model that effectively reflects the similarity (or distance) between the training sample and the test sample, and thus matches the similar action to the same class and distinguishes the dissimilar actions to be the different classes.
The larger the distance between the training sample and the test sample becomes, the more dissimilar the two samples are. For example, Hu et al. [31] developed a sharable and individual multi-view metric learning approach for visual recognition, which jointly learns an optimal combination of multiple distance metrics on multi-view representations. Tran and Sorokin [32] applied the Large Margin nearest neighbor (LMNN) algorithm to behavior recognition, and used the metric learning algorithm to obtain a Mahalanobis distance (learning matrix in Mahalanobis distance) to improve the performance of human activity recognition.
The above metric learning methods [31], [32] learn a fullrank matrix, such that the similarities between the sameclass samples become higher, while those between the different-class samples keep far apart with the learned matrix.
Nevertheless, learning a full-rank matrix is often inefficient and is subject to overfitting. Hence, an effective way to overcome these problems is to impose the regularization (such as the sparse or low-rank constraint) on the learned matrix. For example, Shen et al. [12] obtained a low-dimensional distance metric via the low-rank constraint. Fang et al. [13] proposed an adaptive metric learning with the low-rank constraint method. Unfortunately, imposing the commonly-used low-rank constraint (i.e., the trace-norm of the matrix) on the optimization problem of metric learning still leads to high computational cost. Recently, the Convolutional Neural Network (CNN) based metric learning methods have also been developed [14], [15]. For example, Song et al. [14] proposed a structured clustering loss to train CNN. Wang et al. [15] developed a novel angular loss for deep metric learning, which greatly improves the traditional triplet loss by imposing geometric constraints for triplets. However, these methods suffer from the high computational complexity.
In this paper, we propose a novel Cholesky decomposition based metric learning (CDML) method for effective video-based human action recognition. The similarity matrix is learned by taking advantage of Cholesky decomposition. Compared with the traditional low-rank metric learning methods [12], [13], the proposed method achieves the low-rank property of the similarity matrix by limiting the rank of the lower triangular matrix. Therefore, the proposed method is able to simultaneously perform dimensionality reduction and metric learning, where the optimization problem can be easily and efficiently solved via the gradient descent method. Experimental results show that our method achieves the competitive performance compared with the state-of-thearts on the challenging video-based human action recognition dataset.
The whole workflow of our algorithm is shown in Figure 1. For each video sequence, the improved Dense Trajectories technique [16] is firstly used for feature detection. For each trajectory, we select three effective feature descriptors for feature extraction. Next, we use the white principal component analysis to reduce the dimensionality and remove noises, and VLDA is employed for feature encoding. Considering that VLDA features have high dimensionality, we further propose to learn a similarity metric to match two sample pairs based on Cholesky decomposition.
Our main contributions of this paper are summarized as follows: 1) The similarity matrix is learned by taking advantage of Cholesky decomposition, which decomposes the matrix into the product between a lower triangular matrix and its symmetric matrix.
2) The proposed method achieves the low-rank property of the similarity matrix by limiting the rank of the lower triangular matrix.
3) Our method is able to simultaneously perform dimensionality reduction and metric learning, where the optimization problem can be easily and efficiently solved via the gradient descent method. 4) The multiple feature descriptors are combined and the improved dense trajectory [16] by removing the background trajectory and the distorted optical flow, is used as the feature detector for human action recognition. This paper is organized as follows: we firstly describe the feature extraction of action recognition. Secondly, the metric learning algorithm based on Cholesky decomposition is proposed. We then give the experimental results and analysis on public video datasets. Finally, the conclusion is made.

II. FEATURE EXTRACTION OF ACTION RECOGNITION
In this section, we mainly discuss the feature extraction methods for action recognition. As shown in Figure 1, feature extraction is mainly divided into three parts: original lowlevel feature extraction, feature preprocessing, and feature encoding.

A. THE ORIGINAL LOW-LEVEL FEATURE EXTRACTION
In order to extract the low-level local features, it is generally divided into two steps: Feature detector and feature descriptor. Feature detectors mainly include 3D Harris corner Point [17], Space Time Interest Points [18], Dense Trajectory [19], Improved Dense Trajectory [16], etc. The first two feature detectors are mainly to select the local region which is more advantageous for subsequent recognition and classification by optimizing the loss functions. The latter two feature detectors mainly obtain a large number of points of interest by dense sampling in each frame of the video sequence, and track the detected points of interest to form different dense tracks.
The dense trajectory method [19] is a very classical action recognition algorithm, which deals with dense sampling feature point detector, feature point tracking and feature extraction based on trajectory. Dense trajectory is the points of interest that are intensively collected for each frame in the video sequence. It is generally an extraction of feature points for each scale in multi-scale space (usually 8 scales). The usual scheme of dense trajectory is first to sample an image at W points and track each of these points separately on each scale. Assuming that the sample point in the t−th frame is P t = (x t , y t ), the position in the (t+1) − th frame through the median filter in the dense optical flow field in where M is the median filter core; ( _ x t , _ y t ) represents the rounding position of (x t , y t ); and ' * ' denotes convolution operation. Once the dense optical flow area is detected, the points in the region can be tracked further to form a trajectory, namely (P t , P t+1 , P t+2 , . . .). In the video-based point tracking, object drifting often occurs. To avoid this problem, you can fix the length of the trace to be K frames. If the tracked trajectory is longer than K frames, the feature point is removed. Wang and Schmid [16] further enhances the performance of dense trajectories by estimating camera motion, known as the improved dense trajectory. This method is to improve the performance of the trajectory by using a robust homography matrix to estimate the camera action to remove the background trajectory and the distorted optical flow. In this paper, the improved Dense Trajectories technique is used for feature detection.
In the video-based action recognition algorithms, the feature descriptors include Histogram of Oriented Gradients (HOG) [7], Histogram of Option Flow (HOF) [20] and Action Boundary Histogram (MBH) [21]. Compared with other feature descriptors, HOG can maintain good invariance to the geometric and optical deformation of the image. Secondly, under the conditions of coarse spatial sampling, fine directional sampling and strong local optical normalization, as long as the pedestrian can maintain the upright posture, HOG can allow the pedestrian to have some subtle body movements, which can be ignored without affecting the detection effect. Therefore, the HOG feature descriptor is particularly suitable for human detection in images. The HOG is used to describe the target's appearance, which can be characterized by the histogram of each gradient direction in the local region of the image. The basic steps of HOG feature descriptor are: Firstly the target image is divided into several cells and the gradient histogram of each cell is calculated. Then, the multiple cells are grouped into blocks, and all the cells in one block are calculated. The gradient histograms are connected to form the corresponding features of the block structure. Finally, the gradient histograms of the different blocks are connected to obtain the final HOG feature descriptor.
Because the size of the target will change with time, the dimensions of the corresponding optical flow feature descriptors will also change. At the same time, the calculation of optical flow is sensitive to background noise, scale change and motion direction. Therefore, HOF is proposed based on optical flow that can not only represent time-domain motion information, but also is insensitive to scale and motion direction. The HOF is mainly used to construct an optical flow gradient histogram. Specifically, the optical flows of the input image in the x direction and the y direction are calculated firstly. Then, for different regions, the magnitude and direction of the optical flow gradient are calculated based on the optical flow of pixels in the x and y directions in the region. And finally, according to the different directions of the optical flow gradient, the final histogram is constructed by using the optical flow gradient magnitude.
The MBH is mainly used to describe the motion information, which creates a histogram by calculating the action boundaries of the optical flows in the x direction and the y direction. The MBH feature descriptor is to extract the boundary information of moving objects (also known as motion boundary histograms), which can play a good role in pedestrian detection. In addition, its calculation is also very simple and convenient. Among the three feature descriptors, HOG is the feature descriptor calculated in the image field, belonging to the spatial feature descriptor, while HOF and MBH are calculated in the optical flow image, which is the temporal feature descriptor. The HOF describes the optical flow gradient, while the MBH is mainly used to describe the motion information. Therefore, in this paper, we will combine the above three feature descriptors so as to achieve the better performance for action recognition.

B. FEATURE PREPROCESSING
In video-based action recognition, feature detector and feature descriptors can be used to perform effective the original low-level feature extraction on video sequences. However, the high-dimensional features are obtained by the original low-level feature extraction, and these features exist the larger relevance and are also subject to noises. For example, in a video sequence only requiring the ten seconds time, there are thousands and even ten thousands of trajectories detected by the intensive trajectory algorithm, and each trajectory is usually represented by an high dimensional feature eigenvector, so that the dimensions of the original low-level features are very high, and it also contains some data noise between different trajectories. Therefore, we need to pre-process the features of the original low-level features, in order to reduce the dimension and remove the influence of noise on the subsequent action recognition results. In the paper, we use White Principal Component Analysis (WPCA) for dimensionality reduction and noise removal.
The principle component analysis method is to use the training dataset to obtain linear transformation (corresponding to a projection matrix), and select a few important projection directions (corresponding to the projection vector) to characterize the principle direction, so that the data are projected to the principal component to obtain a new compressed feature vector. When the feature dimension of the data is high and the correlation between the different features exists, the principal component analysis can make each feature irrelevant in the new projection space, and thus the purposes of both dimensionality reduction and denoising are obtained by preserving the principal components. The specific steps of WPCA are as follows: Assuming that the training data is X = (x 1 , x 2 , . . . , x n ), where x i ∈ R d and n > d, WPCA uses a projection matrix to project X into the p (p d) dimensional feature space (assuming the data have been normalized to zero-mean).
Step 1: Calculate the covariance matrix of the training data.
Here, is called as the covariance matrix of the training data and n is the number of training samples.
Step 2: The eigenvalues of the covariance matrix and their corresponding eigenvectors are calculated. The eigenvalues are sorted by the descending order to get the eigenvalue set as {λ 1 , λ 2 , . . . , λ d } and the corresponding set of eigenvectors as {v 1 , v 2 , . . . , v d }, where d is the total number of eigenvectors.
Step 3: The eigenvectors corresponding to the largest and the second largest eigenvalues are removed and the eigenvectors corresponding to the first p eigenvalues = λ 3 , λ 4 , . . . , λ p+2 can be expressed as a linear transformation matrix V = v 3 , v 4 , . . . , v p+2 . In general, the first few dimensions of WPCA correspond to dramatic changes in appearances (such as light, pose, and other external factors). Therefore, this paper proposes to remove the first twodimensional eigenvector of WPCA to overcome the influence of external factors, in order to improve the performance of the subsequent action recognition.
Step 4: For the test sample y (y ∈ R d ), after the feature projection, the reduced dimension feature vector of the sample is y = diag( )V T y. Here, V is the transformation matrix. diag( ) is the diagonal matrix, and the diagonal elements correspond to the eigenvalues in .
The only difference from PCA is that WPCA uses the eigenvalues to whiten the projection features in the final step, whose advantage is that in the new feature space, different features can be normalized to ensure the consistency of different feature scales.

C. FEATURE ENCODING
Feature encoding is often represented by Fisher kernel [22], [23] with local features. The Bag of Words (BOW) is zero moment of Fisher kernel, while the VLAD (Vector of Locally Aggregated Descriptors) feature is the first moment of Fisher kernel. For the original low-level features extracted from different video sequences, the lengths of eigenvectors obtained are not equal. The length of the eigenvector is related to the number of detected points of interest in the video and the number of frames of the video. The global representation is employed for the original low-level features obtained in each video. Therefore, VLDA is employed for feature encoding in this paper. The feature encoding processing by using VLAD based on Fisher's kernel representation is depicted in the following.
Step 1: Perform the detection of the points of interest in the sub-sections A and B to extract the local features, and reduce the dimensions using WPCA to obtain the lower-level features.
Step 2: The low-level features are extracted from all the data in the training set, and the codebook is obtained by K-means clustering. At the same time, the points of interest extracted in the video sequence are classified to the nearest clustering center according to the nearest neighbor classification.
Step 3: Calculate the difference of the local feature vectors between the point of interest and the cluster center, then calculate the corresponding histogram, and finally obtain the feature vector with uniform dimension.

III. CHOLESKY DECOMPOSITION BASED METRIC LEARNING ALGORITHM
The traditional similarity metric (or distance metric) methods [31], [32] generally measure the differences between samples by calculating the Euclidean distance, cosine distance or Bhattacharyya distance. However, these similarity metric (or distance metric) methods often treat the different features of the sample equally (without distinguishing the different discriminative abilities among different features), and do not take into account the correlation between different features. Similarity or distance metric learning refers to using a training sample set to obtain a metric model that effectively reflects the similarity (or distance) between the training sample and the test sample. We use the learned similarity metric model (or distance metric model) to measure the similarity (or distance) between the training sample and the test sample, so that the following requirements can be met: The similarity between similar samples becomes larger (or the distance becomes smaller), and the similarity between dissimilar samples becomes smaller (or the distance becomes larger). Typically, these similarity functions can be formulated as s(x, y) = x T My, where M is the similarity matrix, and x and y are two feature vectors. By using the learned metric s (x, y), the similar images become closer to each other, while the dissimilar images become far apart.
In this paper, the image pairs (i.e., similar pairs and dissimilar pairs) are considered for metric learning. To take advantage of the low-rank constraint on the similarity matrix, we first design the following objective function of metric learning: where x i and x j represent two samples in an image pair. y ij indicates whether the two samples in the pair are in the same class. Here, y ij = 1 indicates that the two samples belong to the same class; otherwise they belong to the different classes. c is the total number of sample pairs. l β (z) is a generalized logic loss function and l β (z) = 1/β * log(1 + e βz ), where β is the regularization parameter. M is the similarity matrix to be learned. rank(M) ≤ d guarantees that the similarity matrix is a low-rank matrix, where d is a constant to constrain the rank of the similarity matrix. The low-rank constraint is a typical non-convex optimization problem (an NP-hard problem), which usually leverages the trace norm regularization to achieve the low-rank constraint. Therefore, the original non-convex optimization problem (minimizing the rank of similarity matrix) becomes a convex optimization problem (minimizing the trace norm of similarity matrix). However, the trace norm is often required to iteratively minimize the trace by singular value decomposition, which is not only time-consuming but also easily leads to overfitting (since the solution of the convex problem is only an approximate solution). This paper uses the Cholesky decomposition, which decomposes the similarity matrix M into the product between a lower triangular matrix U and its transpose U T (i.e., M = UU T ). Specifically, based on the Cholesky decomposition of the matrix, we get the following objective function, s. t. rank(U) ≤ d.
In order to constrain the rank of matrix U, we directly set 0 for those column elements whose column numbers exceed VOLUME 8, 2020 d in the lower triangular matrix (since the majority of column elements are distributed in the matrix whose column numbers are smaller than d). In this way, we do not need to rely on the complicated singular value decomposition to constrain the rank, thus leading to effectively improve the computational efficiency.
Especially, the solution of the above objective function can be directly solved by using the gradient descent technique. The solution can be updated as follows: where t is the index number of t-th iteration and γ is the adjustment parameter for the gradient descent. At the same time, ∇U represents the gradient, which can be calculated as, In order to constrain the rank of matrix U, in each iteration of gradient descent, we directly set 0 for all column elements in the lower triangular matrix whose number of columns exceeds d.
The specific steps of the CDML method are shown in Algorithm 1.

Algorithm 1 Cholesky Decomposition Based Metric Learning (CDML)
Input: data matrix X; the similar pairs S and the dissimilar pairs D; gradient descent parameter γ ; Output: mapping matrix U; 1: Initialize U 0 by applying WPCA to X; 2: Set t = 0; 3: Repeat:

6:
Set those column elements whose column numbers exceed d to be 0 in the lower triangular matrix; 7: t = t + 1;

8:
Calculate the loss function based on (3); 9: Until t achieves the maximum iteration number or the convergence condition is satisfied. 10: Return U.
It should be pointed out that, unlike the hint loss function used in the traditional metric learning algorithm, the objective function (as shown in (5)) of the similarity metric learning algorithm proposed in this paper adopts a smoother generalized logical loss function l β (z). The characteristic of this objective function is that it can be differentiable everywhere, so it is easier to solve the optimization problem using the gradient descent method. At the same time, this paper proposes a new similarity metric algorithm based on Cholesky decomposition of the metric matrix, which converts the metric matrix into a product of a lower triangular matrix U and its transpose U T , which can be controlled by the rank of U, thus satisfying the low rank property of the metric matrix.

A. EXPERIMENTAL SETUP
The experiments are conducted on the ASLAN dataset [24], which is collected in complex natural scenes. It contains 3,697 different videos collected from YouTube, with a total of 432 action categories. The details of the ASLAN dataset are shown in Table 1. Both the training set and the test set use similar pairs and dissimilar pairs. Here, a similar pair refers to actions in two videos that are classified to be the same type of action or behavior. Dissimilar pairs mean that actions in the video are classified as different types of actions. Figure 2 shows some of the similar pairs and dissimilar pairs in the ASLAN dataset. Therefore, the task of video action recognition in this paper is to determine whether the actions occurring in the two video samples are the same type of action.
The performance of video-based action recognition is evaluated by video verification (to decide whether two videos present the same action or not). The verification rate can be used to measure performance for video verification. The rate measures the ratio of the number of the correctly classified pairs and the total number of all pairs. In the ASLAN dataset, we first divide all video sequences into 10 subsets, where each subset contains 350 similar pairs and 350 dissimilar pairs. Then, the performance is evaluated by 10 cross validations. That is, the 9 subsets are randomly selected as the training set, and the rest one is used as the test set. Finally, the mean and variance of the 10 verification rates are taken as the final results.
The methods used for performance comparisons include the Dense Trajectories (DT) method [19], the Low-Rank Metric Learning (LRML) method [12], the Adaptive Metric Learning (ARL) method [13], the Low-Rank Geometric Mean Metric Learning (LR-GMML) method [25], the Multiview Discriminative Analysis of Canonical Correlations (MDACC) method [26], and the multimodal hybrid centroid canonical correlation analysis (MHCCCA) method [27]. The DT, MDACC and MHCCCA methods are the popular and representative video-based action recognition methods. The LRML, LR-GMML and ARL methods are effective similarity (or distance) learning methods based on the low-rank constraint. For all the competing methods, we use the same features as the proposed method for metric learning. In addition, we use the traditional Euclidean distance and Cosine distance as the baselines for comparison.
For the algorithm proposed in this paper, we use the improved dense trajectory [16] as the feature detector in the experiment. First, for each trajectory in the dense trajectory, we select three feature descriptors, i.e., HOG, HOF and MBH. The HOG feature descriptor is 96-dimensional, the HOF feature descriptor is 108-dimensional, and the MBH feature descriptor is 192-dimensional. Finally, the low-level local features of 396 dimensions are obtained. Next, WPCA is used to reduce and denoise the low-level local features, and the 396 dimension is reduced to the 20-dimensional feature space. Then, the VLAD features are used for encoding, and the number of the codebook centers of clustering are 256, so the features obtained after the VLAD encoding process are 5,120 dimensions. Finally, a similarity metric learning algorithm based on Cholesky decomposition is used to learn a metric. In addition, we use the traditional cosine distance and Euclidean distance as the benchmark (using the same features as the algorithm in this paper) for comparisons. All the competing methods are implemented using MATLAB with an Intel 6700K 4.0 GHz CPU and 8.00 GB RAM.

B. EXPERIMENTAL RESULTS AND ANALYSIS
In experiments, we respectively use a single feature descriptor and the combined feature descriptor for feature extraction. The experimental results using a single feature descriptor are shown in Table 2. In addition, the results obtained by the different competing methods (using the combined feature descriptor) are shown in Figure 3.
From Table 2, we can see that based on a single feature descriptor, compared with the Cosine distance and the Euclidean distance, the similarity learning (or distance learning) methods are beneficial to improve the recognition rate (about 3% to 7% improvements), which shows the effectiveness and importance of metric learning. The results obtained by the Cosine distance are slightly better than those obtained by the Euclidean distance. This is because the Cosine distance measures the changes in angle (removing the effect of different scales on the feature vectors). The proposed method obtains about 7% improvements in the verification rate for each feature descriptor, compared with the Cosine distance. Compared with LRML, ARL and LR-GMML, our CDML method obtains the 2%∼5% improvements in performance. Moreover, the proposed CDML method requires the less training time than the other competing methods except for the DT method, but the higher video verification rates are obtained by our method than the DT method. Therefore, our method can effectively learn the similarity between different video actions, since the low-rank matrix based on Cholesky decomposition not only effectively achieves dimensionality reduction, but also contains discriminative information that can be used to identify different video actions. Figure 3 shows the video verification rates obtained by the different methods by employing the combined feature descriptor (i.e., HOG+HOF+MBH). It can be seen that using metric learning can effectively improve the performance in video-based action recognition. Compared with a single feature descriptor, the combined feature descriptor can further improve the performance of all methods. As shown in Table 2, based on a single MBH feature descriptor, the mean video verification rates of our method is highest, i.e., 60.59%, and however, it is lower than the video verification rate of 63.3% obtained by the proposed algorithm on the combined feature descriptor. This is mainly because the three feature descriptors describe the information of the video sequence from different aspects: The HOG mainly describes the gradient information in the local area of the image. The HOF describes the optical flow gradient, while the MBH characterizes the motion information of the foreground target. Therefore, the proposed method considers the advantages of the HOG, HOF and MBH feature descriptors, and the effectiveness of the combined feature descriptor is verified. In addition, compared with the two representative video-based action recognition methods (i.e., MDACC and MHCCCA), the CDML method obtains the better results, due to the effective metric learning based on Cholesky decomposition. In particular, compared with the other three metric learning methods, i.e., LRML, ARL and LR-GMML, the proposed method can also effectively learn and characterize the changes of the different video sequences in the complex environments.
According to the results reported on Table 2 and Figure 3, we can see that the proposed algorithm can recognize the most challenging actions and the video sequences can be matched by using the metric learning method based on Cholesky decomposition. In particular, compared with the LRML algorithm and the ARL algorithm, we can see that the proposed similarity metric learning algorithm can effectively distinguish the changes of the actions in the different video sequences.

V. CONCLUSION
This paper proposes a new metric learning method called CDML based on Cholesky decomposition for video-based human action recognition. Different from the traditional lowrank metric learning methods, our method takes advantage of Cholesky decomposition to decompose the similarity matrix into the product between a lower triangular matrix and its symmetric matrix. In this way, the low-rank constraint of the matrix can be naturally achieved by controlling the rank of the lower triangular matrix. Therefore, to avoid the problems existing in the traditional methods that require the repeatedly iterative calculations of the singular value decomposition, the proposed method effectively uses the learned matrix based on Cholesky decomposition, where the samples in the highdimensional feature space can not only be projected onto the low-dimensional feature space, but also the similarity between similar samples is larger and meanwhile the similarity between dissimilar samples is smaller. Experimental results show the effectiveness and efficiency of the proposed method. In future, we will further evaluate the proposed algorithm in the real applications.