Active Video Hashing via Structure Information Learning for Activity Analysis

When attempting to analyze and understand large-scale video datasets, choosing training videos using active learning can significantly reduce the annotation costs associated with supervised learning without sacrificing the accuracy of classifiers. However, to further reduce the computational overhead of exhaustive comparisons between low-level visual feature with high dimensions, we have developed a novel video hashing coding model based on an active learning framework. The model optimizes mean average precision in a straightforward way by explicitly considering the structure information in video clips when learning the optimal hash functions. The structure information considered includes the temporal consistency between successive frames and the local visual patterns shared by videos with same semantic labels. Rather than relying on the similarity between paired videos, we use a ranking-based loss function to directly optimize mean average precision. Then, combined with the active learning component, we jointly evaluate the unlabeled training videos according to the uncertainty and average precision values with an efficient algorithm based on structured SVM. Extensive experimental results over several benchmark datasets demonstrate that our approach produces significantly higher search accuracy than traditional query refinement schemes.


I. INTRODUCTION
Due to the popularity of digital cameras and the emergence of social media websites, such as YouTube, Facebook and Youku, the amount of videos that can be accessed by everyone has increased exponentially over the past decade. Facing such proliferation of videos, how to efficiently analyze and understand the contents of videos in large scale has become a significant challenge in the field of computer vision [1]- [3]. In machine learning, training videos with high quality annotations is definitely beneficial to better performance. However, it is usually tedious and time consuming to label videos in large scale; especially, the accurate and reliable annotations for some specific area should be given by some experts possessed of the corresponding background knowledge. These issues severely limit the application of supervised learning in practice, where a large amount of videos are collected without labels.
The associate editor coordinating the review of this manuscript and approving it for publication was Nuno Garcia .
In the past decades, although some studies have attempted to address the issues above through semi-supervised learning, to the best of our knowledge, these approaches typically suppose all unlabeled training videos are equally informative and input all of them at once [4]. In fact, on one hand, substantial unlabeled data might cause large uncertainties and potential threats [5], [6]; On the other hand, an arbitrary unlabeled training videos may even be redundant [7]. Active learning technique, as a complementary way in supervised learning, adaptively distinguish the informative videos for which labels should be requested. As a result, this technique reduce the annotation cost and simultaneously do not scarify the accuracy. Recent studies have suggested the effectiveness and efficiency of active learning in many areas, such as machine learning, data mining, information retrieval, computer vision [8]. The challenge of active learning for video understanding is how to decide whether a video is ''informative'' or ''useful'' [9]- [14]. From this perspective, there are two criteria in the literature for selecting training videos to be annotated, i.e., representativeness sampling and uncertainty sampling. For example, Tong and Chang [15] proposed the support vector machine (SVM) active learning algorithm to conduct effective relevance feedback concept accurately and quickly. In this algorithm the informative training data is defined as the ones close to the classification boundary of SVM classifier. In a later work, Wang et al. [16] incorporate the unlabeled training data into the learning procedure of bootstrapping. Besides the uncertainty information, Brinker [17] incorporated a diversity measure into the active learning.
Another significant difficulty in large scale video analysis stems from the high-dimension low-level visual features representation of videos with complex contents, which usually suffers from the curse of dimensionality. Besides of the semantic gap between low-lever features and high-lever concepts, it is computationally costly and practically prohibitive for content-based retrieval through exhaustive comparisons of low-level visual feature, when handling a large collection of video clips [18]- [20]. To address this issue, video hashing has attracted considerable attention from researchers and practitioners because of its specific requirements, robustness, security and computational efficiency over millions or even billions of video clips [21]. In the past decades, a variety of hashing techniques have been exploited [22]. Some researchers propose to encode videos into a set of compact hash codes through incorporating various machine learning techniques while preserving the underlying geometric structure of the original data, such as locality sensitive hashing (LSH) [23], self-taught hashing [24], spectral hashing [25], and so on. However, some existing video hashing approaches directly employ still image-hashing techniques for each frame of video and then concatenate these frame-lever hashing code into a high-dimensional binary vector [26], [27]. This strategy costs computationally for the high-dimensional hash code as well as the extensive frames of long videos [28]. Although some studies select only one key frame to represent the videos [29], it is sensitive to frame dropping and noise [28]. In fact, one video consists of multiple frames with dynamic backgrounds and evolving people activities [30], and thus the temporal information plays a significant role in the content understanding of videos. Making use of this information is beneficial to achieve better performance.
To take the temporal evolution of video content information into consideration, some efforts are devoted to exploiting the correlations between the sequence of frames. For example, in the field of robust video hashing, Coskun et al. [31] were based on low-pass coefficients of 3D cosine-based transforms and represent a video segment as a threedimensional matrix; Esmaeili and Ward [28] divided the sequence into video segments, and then generate a temporally informative representative image of each segment by linearly combining the frames in this segment. In the field of content-based video retrieval, Ye et al. [18] explicitly took into account the specific structure of structure information in video clips. This method achieves promising results over some benchmark datasets; however, it learns the optimal hashing functions based on the similarity (distance) between pairs of training videos but do not optimize the ultimate retrieval performance straightforwardly. Additionally, the hash coding method proposed in [18], [32] are conducted in the framework of supervised learning, which might not be desirable for many applications where a large number of videos are collected without labels.
In this paper, we are based on the framework of active learning and develop a novel video hashing codding model to optimize the average precision (AP) straightforwardly. This model explicitly take the structure information of video clips into consideration, including the temporal consistency between successive frames together with the local visual patterns shared by videos with same semantic labels. We learn the hash functions based on ranking-based loss function that maximize mean average precision (MAP) performance. Note that the order of the relevant videos are preserved in the framework of AP optimization rather than the purely absolute similarity between each pair of videos. Instead of using supervised learning, we involve the proposed video hash coding with structure information into a framework of active learning to mitigate the tedious work of manual annotation for large scale videos. Specifically, our algorithm starts with a small number of seed labeled videos, followed by linear hash functions learning with structure information of video clips. Based on the current hash functions, some training videos are selected to be annotated by evaluating the uncertainty and AP of the candidate videos in the active pool, where the criterion related to AP ensure the winner videos selected for labeling increase the performance of AP.
The contributions of this new model lie in three aspects 1) Considering the structure information of video clips, we learn the linear hash functions by optimizing the average precision (AP) straightforwardly. 2) We involve the proposed video hash coding into the framework of active learning, where unlabeled video candidates are evaluated according to the uncertainty and the value of AP. 3) An efficient algorithm is exploited to solve the proposed challenging problems based on structured SVM. Extensive experimental results demonstrate the effectiveness and efficiency of this algorithm.

II. AP OPTIMIZATION WITH VIDEO HASHING
In this section, we first introduce the frame-level hash coding based on linear embedding and define the similarity measurement of videos in binary subspace. Then, we are based on the structured SVM and develop a general framework of AP optimization with video hashing.

A. HASH CODING OF VIDEO
We suppose that the training set consisting of N videos is denoted by , n i and i = 1, 2, · · · , N . Given any video frame x i t , we focus on learn its K -bit binary code h( for k = 1, 2, · · · , K , where w k ∈ R d refers to a hashing (classsification/decision) hyperplane for the k-th bit of the code. For a better representation, we collect all w k (k = 1, 2, · · · , K ) into matrix W = [w 1 , w 2 , · · · , w K ] ∈ R d×K , namely transformation matrix for hashing coding. b k is the intercept which equals to 0 if the frames in training dataset are zero mean. Sign function sign(·) (element-wise) is used in this paper for its simplicity and efficiency.
Following an effective metric methodology in [33]- [35], we define the similarity of hash codes between query video Q ∈ Q and videos in training set X through function S : where the function s : 1] is defined to measure the cosine-similarity between video frames by This definition suggests that the cosine-similarity of hash codes between a pair of videos equals to the average cosinesimilarity of each pair of frames. It encourages most of frames involved in a video play significant roles in measuring the similarity between videos. However, the similarity measurement defined above is not tractable because the discrete nature of sign function in each linear hash functions makes it non-differentiable and hard to optimize. Intuitively, we relax the Eq. (2) by replacing the sign function with its signed magnitude. In such a way, similar videos not only share same sign but also large projection magnitudes, meanwhile dissimilar videos have not only different signs but also as far as possible. Therefore, the Eq. (2) can be reformulated as where ·, · F denotes the Frobenius inner products and scatter matrix S(Q, X i ) ∈ R d×d is defined as Note that the descending order of S(Q, X i ) (i = 1, 2, · · · , N ) based on the cosine-similarity is equivalent to the ascending order of Hamming distance between the hash codes of video query Q and videos in training set X . In fact, the cosinesimilarity has been proved to generate the identical result with one based on Hamming distance [33].

B. AP OPTIMIZATION BASED ON HASH CODING OF VIDEO
Given video query Q ∈ Q, we denote by X + Q ⊆ X and X − Q ⊆ X the subsets of relevant and irrelevant videos in training dataset X . It is said that video X i is ranked higher is the ground truth ranking of videos in training set X given video query Q is specified in this paper as Based on the partial order feature representation, we define a discriminant function : . Given a video query Q ∈ Q and the discriminant function (6), the expected ranking Y ∈ Y based on hash codes can be derived by solving the following problem As a result, the optimal ranking Y of training videos X i ∈ X (i = 1, 2, · · · , N ) given video query Q can be achieved simply in descending order of S(Q, X i ) (i = 1, 2, · · · , N ) when the hash functions h k (k = 1, 2, · · · , K ) are fixed. This prediction rule is exactly satisfied since the descending order of S(Q, X i ) (i = 1, 2, · · · , N ) is equivalent to the ascending order of the Hamming distance between hash codes of query video Q and video X i (i = 1, 2, · · · , N ).
Following the framework of structural SVM, the ranking problem given video query Q ∈ Q is cast to learn linear hash function h k (k = 1, 2, · · · , K ) which are parameterized by hashing hyperplane parameters matrix W ∈ R d×K . Specifically, given each video query Q ∈ Q, we firstly define the disagreement between any ranking Y ∈ Y and the ground truth ranking Y * based on discriminant function by δ : Intuitively, function δ (Q, Y * , Y ) measures how compatible the current predicted ranking Y is with the grand truth ranking Y * give video query Q ∈ Q. The structure SVM is applied to maximize the margins between the ground truth ranking Y * and all the other possible rankings Y ∈ Y given each video query Q ∈ Q, i.e. , for any Q ∈ Q and Y ∈ Y, the following inequality holds where ξ Q ≥ 0 is a slack variable and the empirical ranking loss with average precision is defined as and the average precision where Prec@k(Y ) denotes the percentage of relevant videos in the top k videos in predicted ranking Y ; 1 X k ∈X + q is an indicator function which is defined as

III. THE PROPOSED METHODOLOGY
In this section, we concern the structure information of videos and encode this information into the similarity measurement of video hashing codes. Following such strategy, we propose a novel video hash coding model in the framework of AP optimization and exploit an efficient algorithms to solve the challenging problem.

A. VIDEO HASHING WITH STRUCTURE INFORMATION
Recall that we represent each video as finite number of successive frames given a video-lever label, and thus it is imperative to constrain that the two hash code of successive frames are close to each other as much as possible [18]. Given hashing transformation matrix W , we concatenate all successive frame pair differences of video training set X into matrixX ∈ R d×T whose columns are defined as . Then the overall difference between each pair of successive frames in training set X is formulated as a function of the hash transformation matrix W , i.e. , where (W X ) :,j refers to the j-th column of matrix W X for j = 1, 2, · · · , T ; The ∞ -norm of any vector z = [z 1 , z 2 , · · · , z d ] ∈ R d is defined as Intuitively, minimizing the overall difference T (W ) based on ∞ -norm indeed ensures stronger constraint such that the hash codes of each successive frames are close to each other. Based on the structure information of the videos in training set X , we learn the optimal hashing hyperplane parameters matrix W ∈ R d×K in the framework of structured SVM by solving the following optimization problem, where λ > 0 and γ > 0 are trade-off parameters; ξ i ≥ 0 (i = 1, 2, · · · , N ) are slack variables; Y * i ∈ Y refers to the ground truth ranking of videos in training set X given the i-th video query X i ∈ X for i = 1, 2, · · · , N . The 2,1 -norm of matrix W = [w ij ] ∈ R d×K is defiend as where W i,: refers to the i-th row of matrix W . Thus, the 2,1 -norm based regularizer with trade-off parameter λ aims to learn an efficient hash transformation matrix W with only a small number of rows are non-zero. In such a way, some specific dimensions of frame feature that corresponds to the zero rows of W are capable of being ignored for a better robustness [36]; At the same time, this strategy makes the learned video frame features lie in a local common patterns related to a certain categories and convey more discriminative information [18].
For efficiency and compactness, we follow the idea use in [37] and employ only a single slack variable across all the constraints in optimization problem (14), namely the ''1-slack'' formulation. This strategy leads to less computation time than the traditional structural SVM [37], [38]. Thus, the optimization problem (14) for optimal solution of hash transformation matrix W is reformulated as where Y i ∈ Y for i = 1, 2, · · · , N and (Y 1 , Y 2 , · · · , Y N ) ∈ Y N collects a batch of rankings corresponding to the N videos in training set X . It has been proved in [37] that the formulations of problems (14) and (16) are equivalent to each other.

IV. JOINT ACTIVE HASH LEARNING SCHEME
In this section, we follows the framework of pool-based active learning [39], [40] and select new videos only from the pool of unlabeled videos. The most important issue lies in how to measure the informativeness of the unlabeled videos. We develop an approach to actively select a set X A = {X k : k = 1, 2, · · · , L} consisting of L informative videos for user to label, namely the active set of videos. With the procedure of active selection, on one hand, we can avoid labeling millions of videos in large scale dataset; on the other VOLUME 8, 2020 hand, the more informative videos can be utilized to learn a better hashing function. In this paper, two criteria are considered to measure the informativeness of video, including the certainty of the unlabeled video and estimation of average precision.

A. CRITERION RELATED TO CERTAINTY
Recall that in the hashing model of video with structure information, the binary hashing code of video frame . As a result, each hyperplane w k corresponds to one bit of the hashing code. The k-th hashing bit of video frame x i t is 1, i.e. , h k (x i t ) = 1 if x sits on the positive side of a decision hyperplane w k , and h k (x i t ) = −1 otherwise. As a result, the certainty of video X is defined on the basis of its structure information by: where · F represents the Frobenius norm of a matrix. Intuitively, the criterion formulated as Eq. (17) ensures means that the smaller the distance of a video X to the hyperplane matrix W , the more uncertain the video is. Considering an extreme case where all of frames involved in one video lies exactly on the hyperplanes, then it is definitely uncertain to decide whether its corresponding bit should be 1 or −1.
In other words, the smaller the value of certainty, the more informative of the video. According to this criterion, the total certainty of L selected videos is formulated as X ∈X A W X F , and therefore, we can select a set of videos that are most informative by solving the following optimization problem,

B. CRITERION RELATED TO AP
We design another criteria on the basis of error reduction for active learning [40] such that once new video set X A is selected, it minimize the error of generalization of the new hash functions. Recall that in the hashing model of video with structure information, we learn K hash function by optimizing the average precision over training videos with weak ranking (relevant and non-relevant). Therefore, the winner videos in active set X A should increase the performance of average precision. Combining the similarity measurement of video with structure information, in this paper, we adopt the precision-oriented strategy in [39], [41] for active selection of unlabeled videos. This strategy have shown its significant effectiveness in the performance of MAP for image retrieval [39]. We compute the similarity between unlabeled video X and any labeled videos X i involved in subsetX based on cosine-similarity S(X , X i ) and then rank the labeled videos Algorithm 1 Active Video Hashing With Structure Information Input: Training data set Z, initial labeled dataset X and the corresponding ground truth ranking; Trade-off parameters λ, ρ > 0; Accuracy threshold > 0. Output: Hashing projection matrix W .
1: Supervised video hashing over labeled dataset X : 2: Solve the optimization problem (14) for W . 3: Joint active selection: 4: Solve the optimization problem (21) for X A . 5: Label videos in X A and update X = X X A . 6: Repeat step 1 to 5 for several iterations.
according to these similarities, denoted byŶ . Let the ground truth rank for these labeled videos beŶ * . It becomes feasible to compute the average precision AP(Ŷ * ,Ŷ ) and the AP loss This criterion based on AP measures the informativeness of unlabeled video intuitively. The bigger the value of AP(Ŷ ,Y i ), the more opportunities to increase the average precision when video X is labeled. Based on the criterion formulated in Eq (19), the total AP loss of the selected videos is summarized as X ∈X A A 2 (X ). Therefore, we select the active set of videos by solving the following optimization problem min Combining the criteria related to certainty and AP, our final strategy for active video selection is achieved via considering a trade-off between getting the closest videos and optimizing the average precision, i.e., joint active selection by solving the optimization problem as follows where ρ > 0 is a trade-off parameter. This optimization problem can be solved greedily due to the finite number of videos in the pool. Since we select the active videos according to both factors of the certainty and average precision loss, if video X ∈ X A is firstly selected and labeled as positive for example, it is the closest to positive labels and will be well ranked without bring irrelevant videos in the top of ranking. Finally, we well request the users to label all active videos selected in X A and update the labeled data set X to retain the supervised video hashing model with structure information. The alternative process of learning video hash functions with structure information and active selection of unlabeled videos can be repeated for several iterations. We summarize the joint active video hashing learning with structure information in Algorithm 1 for details.

C. OPTIMIZATION PROCEDURE
In this section, we focus on solving the proposed challenging optimization problem (16) with cutting-plane algorithm. This method alternatively updates the hash transformation matrix W and the most violated batch of constraints. The details of this algorithm is described in Algorithm 2, where the key issue lies in solving the following optimization problem where R(W ) = W 2,1 + T (W ) and finite set C collects the most violated batch of constraints. According to the Pegasos algorithm proposed in [42] for traditional SVM, we solve optimization problem (22) by alternating between stochastic sub-gradient descent and projection steps, where the subgradient descent is performed iteratively by picking the most violated ranking at each iteration picks to minimize the slack variable.
Considering the non-smoothness of 2,1 -norm andMAP for image retrieval ∞ -norm used in R(W ), we following the Nesterov's smooth approximation [43] and approximate the term of R(W ) by where the approximation function f i,µ (W ) and g j,µ (W ) are differentiable and their gradient is Lipschitz continuous with the constant 1/µ. They are defined respectively as for i = 1, 2, · · · , d and j = 1, 2, · · · , T , where µ > 0 is a smoothness parameter to control the accuracy of the approximation. W i,: denotes the i-th row of matrix W . Vectors v, u ∈ R K correspond to the auxiliary variables of W i,: ∈ R 1×K and (W X ) :,j ∈ R K , respectively. As a result, the update for W at each iteration of Pegasos algorithm is given by Subsequently, we give the derivation of the gradient of regularizer R µ . When the variable W is fixed, the unique maximizer for optimization problem (23) is derived by Algorithm 2 Video Hashing With Structure Information Input: Video training set X , ranking Y * 1 , Y * 2 , · · · , Y * N , tradeoff parameters λ > 0 and accuracy threshold > 0. Output: Transformation matrix W and slack variable ξ ≥ 0 1: Initialize the constraints set C ← ∅ 2: repeat 3: Solve for the optimal W ∈ R d×K and slack ξ ≥ 0 by : ), v(W 2,: ), · · · , v(W d,: ))] ∈ R d×K (26) On the other hand, the unique maximizer for optimization problem (24) is calculated using l 1 -ball projection algorithm in [44], i.e. , u((W X ) :,j ) = arg min for j = 1, 2, · · · , T . The gradient of g j,µ (W ) is derived as whereX :,j ∈ R d is the j-th column of matrixX (j = 1, 2, · · · , T ). According to the analysis above, the gradient of R µ (W ) with respect to W can be calculated by

V. EXPERIMENTS
In this section, we conduct an extensive set of experiments to demonstrate the superiority of the proposed approach on four widely used benchmark datasets, including the Columbia Consumer Video (CCV) dataset, TRECVID MEDTest 2014, MEDTest 2013 datasets, and UCF 101 dataset.

2) FEATURE EXTRACTION
For each video clip in the datasets, we sample one frame every 2 seconds and extract the frame level CNN descriptors using the architecture of [47]. They key insight in [47] is that by using smaller convolution filters ( 3 × 3) and very deep architecture (16-19 layers), significant improvement on the ImageNet Challenge 2014 can be achieved. Due to its excellent performance on images, we therefore choose to apply the same archicture to the video datasets.

3) COMPETITORS
To validate the performance of the proposed algorithm, we compare with the following alternatives: (1) Spectral Hashing (SH) [25]: We directly use random sampling + spectral hashing as a baseline. Based on the assumption that data are sampled from a uniform distribution, data are partitioned along their principal directions with the consideration of spatial frequencies; (2) Binary Reconstructive Embedding (BRE) [48]: We first select the most informative samples using the state-of-the-art active learning methods in [9], [49]. Then the hash functions are leant based on  [50]: Similarly, we first select the most informative samples using the state-of-theart active learning method in [9]. It plays a tradeoff between supervised information and data properties to design robust hash functions, which aims at alleviating the defects from overfitting or insufficient training data; (4) Active Hashing [51]: Active hashing can actively select the most informative labeled pairs for hash function learning. To be more specific, it identifies the most informative points to label and contructs labeled pairs accordingly; (5) Video Hashing with both Discriminative commonality and Temporal consistency (VHDT) [18]: We first select the most informative samples using the state-of-the-art active learning method in [9]. The video hashing formulates a minimization problem over a structure-regularized empirical loss. Specifically, the structure regularization exploits the common local visual patterns and simultaneously preserves the temporal consistency.

4) EVALUATION PROTOCOLS
To make a fair comparison, we use the hamming ranking as the evaluation criteria, which is widely used in the literature. All the video clips in the dataset are ranked according to their Hamming distance from the query video clips and the desired neighbors are returned from the top of the ranked list. To get the quantitative comparison, we adopt Average Precision (AP) as an evaluation metric for Hamming ranking criterion. We calculate the results for each query and report the mean results across all the queries as the final evaluation metric. The proposed method is implemented using Matlab on a PC with an Intel Duo Core i5-2400 CPU 3.1GHz and 8GB RAM. We tuned all the parameters using 5-fold cross validation and selected the best parameters. We tune the tradeoff parameter λ in the range of {0.01, 0.1, 1, 10, 100} and ρ in the range of {0.3, 0.4, 0.5, 0.6, 0.7}.

B. PERFORMANCE COMPARISON
In the experiments, we test the performance using hash codes with different length (in the range of 12 to 64 bits). We report performance curves in terms of mean Average Precision (mAP) of different methods in Table 1 and Table 2. From the experimental results, we make the following observations: -The proposed AHSI method consitently performs much better than the other alternatives, which confirms its effectiveness for large-scale video hashing problems. -VHDT outperforms the other alternatives by a large margin. This is because VHDT takes advantages of structure informatin of video data while generating hashing codes.
-Both AH and AHSI performs better than the other baseline methods, which demonstrates the importance of selecting informative samples to label.

C. SENSITIVITY ANALYSIS
Contrast to other active video hashing algorithms, the proposed method requires only two parameters including λ and ρ to be tuned. In the previous experiments, we estimate the best parameters using a five-fold cross validation method.
To demonstrate the parameters' influence on the performance, we conduct experiments to analyze the parameters' sensitivity in terms of mAP. Experimental results on the four used datasets are reported in Figure 1. The trade-off parameter λ is tuned in the range of {0.01, 0.1, 1, 10, 100} and ρ is tuned in the range of {0.3, 0.4, 0.5, 0.6, 0.7}. The range of ρ is decided empirically. From the experimental results, we can observe that the performance changes differently with respect to different parameters on different datasets. How to decide the optimal values of the parameters is data dependent. This confirms the importance for us to use cross-validation to choose the best parameter for each data set.

VI. CONCLUSION
In this work, we have introduced a novel video hashing coding method to optimize the mean average precision straightforwardly. The proposed hashing method is built on the framework of active learning. It explicitly concerns the structure information of video clips to learn the optimal hash functions, where the structure information consists of the temporal consistency between succesive frames as well as the local visual patterns shared by videos with the same semantic labels. Instead of using the similarity between each pair of videos, ranking-based loss function is utilized in this paper, which directly maximizes the performance of mean average precision. Extensive experiments on three large-scale video datasets have confirmed the superiority of the proposed method.