Multi-Sensor Integration for Key-Frame Extraction From First-Person Videos

Key-frame extraction for first-person vision (FPV) videos is a core technology for selecting important scenes and memorizing impressive life experiences in our daily activities. The difficulty of selecting key frames is the scene instability caused by head-mounted cameras used for capturing FPV videos. Because head-mounted cameras tend to frequently shake, the frames in an FPV video are noisier than those in a third-person vision (TPV) video. However, most existing algorithms for key-frame extraction mainly focus on handling the stable scenes in TPV videos. The technical development of key-frame extraction techniques for noisy FPV videos is currently immature. Moreover, most key-frame extraction algorithms mainly use visual information from FPV videos, even though our visual experience in daily activities is associated with human motions. To incorporate the features of dynamically changing scenes in FPV videos into our methods, integrating motions with visual scenes is essential. In this paper, we propose a novel key-frame extraction method for FPV videos that uses multi-modal sensor signals to reduce noise and detect salient activities via projecting multi-modal sensor signals onto a common space by canonical correlation analysis (CCA). We show that the two proposed multi-sensor integration models for key-frame extraction (a sparse-based model and a graph-based model) work well on the common space. The experimental results obtained using various datasets suggest that the proposed key-frame extraction techniques improve the precision of extraction and the coverage of entire video sequences.


I. INTRODUCTION
First-person vision (FPV) videos captured by head-mounted wearable cameras are useful for understanding daily life activities [1], [2]. FPV videos often contain important scenes worth remembering in our daily lives. Summarizing such salient scenes is essential because FPV videos tend to be redundant [2]. However, FPV videos are unstable and noisy compared to third-person view (TPV) videos, and most existing methods of video summarization mainly focus on handling the stable scenes in a TPV video [3]- [19]. Moreover, the following differences between FPV and TPV videos substantially complicate summarizing FPV videos compared to summarizing TPV videos. (i) Camera placement: FPV videos are captured from the wearer's viewpoint (e.g., chest and head), whereas TPV videos are captured from a fixed point The associate editor coordinating the review of this manuscript and approving it for publication was Cesar Vargas-Rosales . of view. (ii) Intention: Although TPV videos are intentionally recorded by the photographer, FPV videos are recorded regardless of his/her intention. This unconstrained FPV video often contains insignificant objects, such as a ceiling or a floor. (iii) Content: TPV videos record experiences worth remembering through a manual operation that focuses on specific interesting scenes. FPV videos record natural scenes of life, which may contain repetitive video shots that are irrelevant to our interests. (iv) Quality: Whereas TPV videos contain stable frames, FPV videos tend to contain blurry and shaky frames due to the wearer's body motion. Therefore, TPV summarization techniques applied to noisy FPV videos perform inaccurately and even worse than uniform sampling [2]. To obtain high-quality FPV video summaries, we must address these issues.
In this paper, we present a key-frame extraction method for FPV videos with multi-sensor signals. To reliably select key frames, our method uses motion signals as the extra VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ sensor information beyond video frames, while most existing methods use only video information [3]- [19]. We assume that motion information expresses the detailed hand or head movement that visual information does not capture. To associate their features, we embed multi-sensor data into a common vector space [20]- [27] using probabilistic canonical correlation analysis (PCCA) [28]. The projection matrices learned by PCCA ensure that the relevant pairs of information are close. Moreover, we propose two key-frame extraction algorithms that are performed on this learned space. First, we use a sparse key-frame selection method based on a sparse measure, the l 1 -norm, and extend it with multi-sensor integration. Second, we use a key-frame extraction approach based on a probabilistic graphical model (referred to as a graph model) employing conditional random fields (CRFs) [29] for multi-sensor integration. We show that the proposed multi-sensor integration is effective for key-frame extraction from FPV videos under both sparse-based and graph-based models. This paper is an extension of our conference publications [30], [31] with significant modifications. Two major differences are as follows: 1) We introduce a graph-modelbased method and a sparse-model-based method to extract key frames from FPV videos. Therefore, the proposed multi-sensor integration can improve the key-frame extraction performance across different methods. 2) We expand the experimental results not only by adding more videos to the dataset used in the conference papers but also by introducing another new dataset and quantitative comparisons with the existing methods.

II. RELATED WORKS
Key frames are a group of frame images selected from different scenes in a video and presented in temporal order [4], [32]. Although there are several mathematical definitions of key frames, these definitions commonly model key frames as the most representative and informative frames, reflecting the most important contents in a video [4]- [6], [14]- [16].

1) KEY-FRAME EXTRACTION
Many key-frame extraction methods have been proposed in the literature [3]- [16]. Liu et al. [3] presented an algorithm based on the maximum a posteriori (MAP) method to detect key frames. Ejaz et al. [4] developed an integration scheme to combine the image features obtained from the correlation of RGB colour features, a colour histogram, and moments of inertia to select the key frames. Elhamifar et al. [6] proposed a sparse modelling representation selection (SMRS), which is an efficient algorithm for video classification and summarization. SMRS employs a sparse-coding-based framework with the l 1 -norm as a sparsity constraint. However, the direct utilization of SMRS estimated the null-information frames because of noise and instability in FPV videos [30], [31]. The proposed technique develops SMRS for better key-frame selection from FPV videos.

2) SPARSE REPRESENTATION
Sparse representation is undoubtedly a common model of sparse signals [33]. There are many applications, such as compressive sensing [34], denoising, sampling, classification, superresolution, inpainting, and deblurring, that employ the sparse representation theory and model as fundamentals. In the literature, sparse representation has been further proven to be an extremely powerful tool for representing, analysing, and compressing signals [33], [35]. Aiming for sparsity, most sparse representations employ the l 0 -norm [33], [35], [36] or the l 1 -norm as the sparsity constraint [37], [38]. In this paper, we also use the l 1 -norm as a sparse measure to reduce noise and detect salient activities in first-person videos.

3) GRAPH MODEL
A graph model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are generally used in pattern recognition and machine learning [39]. Two branches of graph models that are generally used are Markov random fields and Bayesian networks. An increasing number of publications in computer vision use graph-based energy minimization techniques for image processing applications, such as segmentation [40], [41], image restoration [42], stereo [43], [44], shape reconstruction [45], object recognition [46], texture synthesis [47], and socialized group photography [48]. For example, Ngo et al. [49] proposed video summarization methods and scene detection algorithms based on graph modelling. In their methods, a video is expressed as a complete undirected graph, and the normalized cut algorithm is applied to globally and optimally divide the graph into video clusters. Molino et al. [50] used a probabilistic approach based on active inference in CRFs, which is a type of discriminative undirected probabilistic graphical model, for active video summarization. In contrast to the existing graph-based approaches that only use video information, our approach additionally uses sensor information as well as video information for accurately modelling daily living activities.

III. MODEL AND FORMULATION
In this work, we propose a multi-sensor integration-based key-frame extraction method for FPV videos. First, we focus on applying the sparse model to select key frames using multi-sensor integration. Second, we use the graph model to select the key frames from FPV videos. Our proposed multi-sensor integration method can be applied to any key-frame extraction algorithm. However, in this paper, we choose two examples: sparse-model-based and graph-model-based algorithms.

A. VIDEO FEATURE SELECTION
First, we extract all the video frames from the raw video and learn the deep semantic features by adopting a pre-trained DNN (e.g., VGG). Inspired by previous work [51], we represent the input video frames as the deep semantic features of the semantic space, and every feature corresponds to a frame. This method encodes the semantic transition of videos. Thus, it is effective for many video processing applications, such as video description, video generation and video retrieval. Some video clusters can be estimated, each of which is predicted to involve similar frames. With this assumption, we estimate a cluster of frames by solving an optimization formulation of the video representation. We use frame deep features learned from the DNN rather than natural video frame images. For feature extraction, we employ a pre-trained VGG network [52], which produces discriminative visual features. Note that the video features in the following sections of this paper refer to the features extracted from VGG. In this way, we convert each video frame into a 1000-dimensional vector. After extracting the VGG features from all N frames, we construct a dictionary matrix, where each row represents N frames and each column contains the 1000-dimensional frame feature vector.

B. PROJECTION WITH MULTI-SENSOR INTEGRATION
We employ PCCA to embed the multi-sensor integration data (video and motion) into a common space ( Fig. 1) [30]. Let X v = R C×N represent video data and X s = R C ×N represent motion sensor data, where N is the number of frames, C is the video feature dimensionality, and C is the motion sensor feature dimensionality. In this paper, the sensor features consist three-axis acceleration (the rate of change in the velocity of an object) obtained by three-axis accelerometers, and rotational changes or maintaining orientation is obtained by gyroscopes. In general, motion data do not have units of frames; more often, they have units of seconds or another time unit. To fuse motion data with video data, we first synchronize the two modalities and then often perform a sampling of the motion data in units of video frames. Linear projections A v and A s from the video and sensor domains to a common space can be generated by the following formulation: where A v ∈ R C×C and A s ∈ R C×C are linear projectors from the video feature domain and the sensor domain, respectively, to the common space with the same dimension, and · F is the Frobenius norm [53]. The optimal projection matrices A v and A s are estimated from the solutions of an eigenvalue problem. After learning the projection matrices A v and A s , we can use them to project data vectors from the video and the sensor domains into the C-dimensional common space, where the corresponding sets of information are similar [30]. Noting that C < C and C < C , it realizes the reduced dimension of the common space after PCCA.  The common space video features X v c = A v X v and the common space sensor features X s c = A s X s are spliced to be the integrated feature matrix X = [X v c , X s c ].

C. SPARSE-MODEL-BASED KEY-FRAME EXTRACTION
First, we propose our multi-sensor integration model for extracting a key frame based on a sparse model. Fig. 2 shows the framework of our approach for video key-frame extraction based on the sparse model. A model for signals formulates a mathematical description of the group of signals, which allows them to be distinguished from the remaining signal space. A linear representation model has been developed and has recently received appreciable attention [54], [55]. Signals can be expressed as linear combinations of the representative signals. This can be formulated as a problem of finding the representative signals as a sparse multiple measurement matrix problem [6]. The sparse modelling method [33], [56]- [58] is the most effective representative methodology of all linear representation algorithms. The aim of sparse modelling is to approximate a natural signal by a linear expression of dictionary atoms. The signal is then represented as linear combinations of a few dictionary atoms. Fig. 3 shows the proposed framework of the video summarization.
To incorporate sparse representation into key-frame extraction, a modification was considered for to the dictionary learning problem, which first addresses the optimization of local minimum due to the generation of two unknown matrices, namely, the sparse coefficient and the dictionary matrices. This enforces learning sparse representations from natural signals [6]. For this purpose, the formulation of sparse-representation-based key-frame extraction can be written as follows: Here, the 0q norm is expressed as: Here, h i is the ith row of the matrix H, and I is the indicator function. Generally, q = 2 is the l 2 norm. H 0q counts the rows of nonzeros in the sparse coefficient matrix H. The index of the nonzero rows of H corresponds to the index of the columns of X, which is selected as the signal representation. The indices of the zero rows of H are redundant frames, which are neighbours of key frames. We select nonzero rows to represent key frames and discard the redundant and irrelevant frames. It is preferable that the extraction of the representation is invariant with respect to the global translation of the signal. Thus, we enforced the affine constraint 1 T H = 1 T . Because the problem of the l 0 norm is NP-hard, we introduced the l 1 norm as a relaxation of this NP-hard problem. The l 1 norm is the sum of the elements of a vector. The proposed objective formulation can be written as: Here, H 1q is expressed as: To normalize the rows of H as the l 2 norm, we take q = 2. The final objective formulation can be written as: where D. FACTOR-GRAPH-BASED KEY-FRAME EXTRACTION Second, we propose our multi-sensor integration model for extracting key frames based on graph models. The frameworks using only video information and multi-sensor integration are shown in Fig. 2 and Fig 3. Methods that solve complex global functions of variables always employ the given function's factor as an output of ''local'' functions, and each function depends on a subset of the variables. This factorization can be expressed by a structure graph, which is called a factor graph [59]. Let s = {0, 1} N be a vector with binary values that represent the summary of frames from the FPV video, where s i is equal to 1 when the ith frame is selected as a key frame and 0 when the ith frame is NOT selected as a key frame. p(s|θ) is denoted as the probability density distribution of how likely the selected frame s is to be selected as a key frame. We select the frames of s i = 1 and omit the frames of s i = 0 to discard the redundant and irrelevant frames. We modelled this distribution by CRF, and θ = [θ 0 , θ 1 , α, γ , β] are the values of its parameters to be defined later in this subsection.
A CRF models the probability density with a Gibbs distribution [50], [60]. Thus, p(s|θ) can be expressed as the normalized exponential of an energy function, which is denoted as E θ (s): p(s|θ) ∝ exp{−E θ (s)}. The summary of the key frames, denoted as s , is generated by solving the MAP as follows: We define the energy function as follows: Here, the unary potential U θ (s i ) enforces the selection of static frames, the pairwise potential P θ (s i , s j ) encourages frames with diverse semantic content, and λ > 0 is a parameter that weights the unary and pairwise potentials. Taking four frames (s 1 , s 2 , s 3 , s 4 ) as an example, we illustrate the unary and pairwise interactions as a graph in Fig. 4. A directed weighted graph includes a group of nodes and a group of directed edges that connect the nodes. Generally, the nodes represent pixels, frames, or other features. A graph normally contains two special nodes referred to as the source s and the sink t; thus, it is called an s-t graph. In the context of vision, terminals correspond to the group of labels that can be assigned to pixels [61]. In our situation, we will focus on the case of the graph with two terminals: the key frame and not a key frame, which is expressed in Fig. 5.
The unary potential, U θ (s i ), defines the baseline to be selected as a key frame. We model where I [Q] is an indicator function. I [Q] = 1 if Q is true and I [Q] = 0 otherwise, and θ q and θ p are constants that balance the ratio of key frames and other frames. The pairwise potential, P θ (s i , s j ), is defined between each pair of similar frames and enforces selecting frames with diverse contents. Let d(ψ i , ψ j ) be the Euclidean distance between the features of two frames i and j, expressed as follows: The pairwise potential enforces that similar frames should not be selected for the summary. For this purpose, we define a potential that is weighted by the distance between features, shown as follows: Here, P θ (s i , s j ) suggests that both frames s i , s j should not be selected at the same time, and the term exp{−d(ψ i , ψ j )} reduces the effort of P θ (s i , s j ) when the frames are dissimilar. The value of the potential, P θ (s i , s j ), is smaller when the frames s i , s j are dissimilar. Specifically, P θ (s i , s j ) is defined as follows: Thus, the optimal solution for key-frame extraction can be obtained by minimizing the potential as follows: We use a general optimization framework of trust-region-based local submodular approximations (LSA-TR) [62] to solve problem (15). The local submodular approximations (LSA) approach constructs an approximation model without additional variables and uses a more accurate approximation. Trust region (TR) methods are a class of iterative optimization algorithms. The model is only accurate within a small region around the current solution called the ''trust region'', and the approximate model is Algorithm 1 Sparse-Model-Based Key-Frame Extraction (SMFE) Require: Signal matrices X v from VGG and X s from sensors 1: Normalize the columns of the signal X v and X s to a unit l 2 -norm. 2: Embed into a multi-information matrix by PCCA. 3: Set the regularization parameters. 4: Initialize H as a random matrix. 5: Execute SMRS [6] with ADMM to estimate the indices of the key frames from the FPV video.

Algorithm 2 Graph-Model-Based Key-Frame Extraction (GMFE)
Require: Data matrices X v from VGG and X s from sensors 1: Normalize the columns of the data X v and X s to a unit l 2 -norm. 2: Embed into a multi-information data matrix by PCCA. 3: Calculate the size of the 2 × N array of unary terms (N is the number of frames in the video). 4: Calculate the size of the M × 6 array, which is a list of M arbitrary pairwise potentials. Each row in this pairwise potential list is of the format [i, j, P θ (0, 0), P θ (0, 1), P θ (1, 0), P θ (1, 1)], where i and j are neighbours and the four coefficients define the interaction potential. 5: Execute LSA-TR [62] to estimate the indices of the key frames of the FPV video.
then globally optimized within the trust region to obtain a candidate solution.

IV. ALGORITHMS A. SPARSE MODEL
This section describes the proposed algorithm for summarization from FPV videos with multi-sensor integration based on SMRS [6]. The coding matrix of SMRS is computed using data self-representativeness (the dictionary is set by the video signals themselves) adopting block sparsity regularization. We employ the alternating direction method of multipliers (ADMM) optimization scheme. The corresponding algorithm is described in Algorithm 1: Sparse-model-based key-frame extraction (SMFE). We used the existing implementation of SMRS 1 for our method.

B. GRAPH MODEL
Next, we summarize the proposed algorithm for summarizing an FPV video through multi-sensor integration based on a graph model. We employ a min-cut/max-flow optimization framework to optimize the corresponding objective function. The corresponding algorithm is described in Algorithm 2: Graph-model-based key-frame extraction (GMFE). We use Gorelick et al. [62]'s implementation of local submodular approximations-trust region (LSA-TR). 2

V. EXPERIMENTAL SETTINGS
We evaluate our proposed key-frame extraction methods using human activity datasets captured in a house. In the following section, we present these datasets in detail.

A. DATASETS 1) CMU-MMAC
The Carnegie Mellon University Multimodal Activity (CMU-MMAC) database [63] is designed to overcome some of the previous limitations by collecting multi-modal (e.g., video, audio, motion capture, and accelerations) signals of human activity. To collect human activity in an environment that is as natural as possible, researchers have installed a nearly fully operable kitchen and collected the preparation of some meals from the beginning to the end. A Firewire camera, FL2-08S2C, is worn on the head of the subject. Accelerometer and gyroscope information is collected with MicroStrain's 3DM-GX1 inertial measurement units.
There are five datasets that consist of cooking five different recipes in the CMU-MMAC database: brownie, salad, pizza, scrambled eggs, and sandwich. Because only the brownie dataset has labels, we use the brownie dataset in our paper. There are 13 videos in the brownie dataset, from B07 to B24.

2) DAILY ACTIVITIES
We used another non-public dataset collected by Miyanishi et al. [64], which we call the daily activities dataset in this paper. This dataset collects the daily activities of 8 persons (not the researchers), whose ages ranged from 21 to 26 years (mean = 23.13, SD = 1.69). These subjects wore wearable motion sensors containing a wearable camera, three-axis accelerometers, and gyroscopes. The subjects executed 20 daily actions at various locations following written instructions on a worksheet without direct supervision from the experimenters. For instance, he/she ''washes dishes'' in the kitchen and ''drinks tea'' in the living room. For each person, there are several sessions containing different actions performed. The recorded sensor signals consist of 17-h videos and motion data of approximately 20 actions. The proposed algorithm selects key frames from these FPV videos using not only video but also motion information [64].
The order of locations where subjects performed their daily activities is shuffled in each session. There is a room layout of the experimental environments and lists the 20 daily activities at each location performed by the subjects in each session of the with-object task. A single session averaged 10.86 min (SD = 1.14) among the subjects. The sessions were repeated 12 times (including two initial practice sessions), and short breaks were allowed. There was no researcher to supervise the subjects while collecting data under the semi-natural collection protocol. The researchers used the motion and video data from the 3rd to 12th sessions of the with-object task 2 http://vision.csd.uwo.ca/code/ as the search target. After the with-object task, to collect gesture motions for retrieving past activities, the subjects were asked to remember and repeat 20 activities that they did in the with-object task experiments as gesture motions, which are used for queries. The second experiment is called a without-object task. Its activities are slightly different from the with-object task activities and required completing each activity during specified times. For example, the activity is to ''pour hot water'' and ''stir a cup of coffee'' rather than to ''make coffee.'' The subjects then repeated the 20 activities; at this time, there was no object, and they were in a new environment.

B. METRICS
To evaluate the key-frame extraction performances of different algorithms, we introduced two metrics: accuracy and entropy.
We use accuracy (A) to evaluate the effectiveness of the proposed methods for key-frame extraction from FPV videos with multi-sensor integration, which can be described as follows: where N Correct denotes the number of selected key frames that are correctly selected with respect to the label and N Whole is the total number of key frames selected by the methods. Note that the labels correspond to different actions in the video; there are start frames and end frames in each label. If the selected key frame is between the start frame and end frame of the label, we consider the key frame to be correctly chosen. However, the metric of A cannot integrally measure the quality of a key-frame extraction. As shown in Fig. 6, cases (a) and (b) have the same accuracy value of 5/6. The results of the key-frame extraction are different. In the case of (a), the selected frames are all focused on event 4. However, in the case of (b), the selected key frames are dispersed (events 1, 3, 4, 5, and 6). Generally, the result of case (b) is better than that of case (a).
To evaluate the information content of different actions in the video, we introduced entropy as a metric for the experimental results, which can be described as follows: Here, p i is the probability of each event extracted by the proposed algorithm. This metric will be maximum if all Choose V i as validation data and the others as training data. 3: for α = α 1 to α M do 4: Apply SMRS to the training data.

5:
Calculate the entropy over the training data. 6: Average the entropy. 7: end for 8: Draw the entropy curves versus the different values of α.

9:
Choose the optimal value of α.

10:
Apply SMRS to the validating data with the optimal α. 11: end for extracted events are equally likely. Thus, a higher entropy value means a better key-frame extraction result. In Fig. 6, the entropy of (a) is 0, and the entropy of case (b) is 2.3219. Thus, the entropy of (b) is higher than that of (a), which means that the key-frame selection result of (b) is better.

C. CROSS-VALIDATION-BASED PARAMETER SETTINGS
To obtain the appropriate parameter α in SMFE, we use cross-validation. We used all videos from the brownie dataset to determine the optimized parameter. At first, we took one brownie dataset video as validation data and the other videos as training data, which is illustrated in Fig. 7. Then, we changed α to different values and calculated the entropy of the training data and averaged the entropy results. From the curves of the average entropy versus the value of α, the optimal choice of α can be determined, which yields the highest entropy value. We describe the steps in Algorithm 3.

VI. EXPERIMENTAL RESULTS
We applied our proposed algorithms separately to the CMU-MMAC dataset and the daily activities dataset. The experimental results are presented in this section.

A. SPARSE MODEL
First, we conducted experiments using the sparse-model-based key-frame extraction algorithm. We performed experiments on the CMU-MMAC dataset with multiple information: video and motion information. To investigate the effects of different values of regularization parameter α on the quality of selected representatives, we considered the brownie dataset as political debate videos. We ran our proposed algorithm with α = 8, 8 √ 2 to investigate the optimal α with respect to different brownie dataset videos from 07, 08, . . ., to 24. Fig. 9 displays the cross-validated entropy with various values of α. Then, we select the optimal α with the highest entropy value. To evaluate the performance of our proposed multi-sensor integration, with the optimal α = 64, we compared the entropies and accuracies using multi-sensor information and pure video and pure motion information. Table 1 presents the evaluation results, from which we can find that the performance using multi-sensor information is better than the ones using pure information in most cases.
Then, we also performed experiments on the daily activities dataset with multiple types of information: video and motion information. To obtain the optimal regularization parameter, α, in terms of the quality of selected key frames, we ran the proposed algorithm from α = 4 to 64 with a multiplicative step of √ 2 to investigate the optimal α with respect to different objects from 09, 11, . . ., to 17. Each object has 10 repeated sessions. We averaged the results of each object and plot the results in Fig. 8, from which we can choose the optimal α with the highest entropy value.
With the optimal α, we compared the entropies and accuracies using multiple types of information and pure video and motion information. Table 3 displays the evaluation results, from which we can observe that multiple types of information achieve better results. Thus, our proposed multi-sensor integration achieved better performance.

B. GRAPH MODEL
We conducted experiments using the graph-model-based key-frame extraction algorithm. We also applied our key-frame extraction algorithm based on the graph model to the CMU-MMAC and daily activities datasets. The parameter settings refer to those in [50]. Similar to the sparse model, we first presented results using the CMU-MMAC dataset, and we performed experiments with multiple information: video and motion information. The parameters were set to λ = 1, θ 1 = 20, α = 5, γ = 1, β = 1. The remaining parameter θ 1 controlled the number of selected key frames. Table 2 shows the results of the entropies and accuracies from various pure information and multiple information. From the experimental results, we inferred that multiple information performs better than only video or motion information. Now, we will describe the experimental results achieved with the daily activities dataset. We took object 09 as a representative case. The parameters were set to λ = 1, θ 1 = 20, α = 5, γ = 1, β = 1. We adjusted θ 1 to control the number of selected key frames. As shown in Table 4, the experiments with multiple sensors achieved better results.

C. COMPARISON BETWEEN THE TWO MODELS
To compare the performances of SMFE and GMFE, we present the results of computational time consumption. The algorithms are run on a computer with an Intel Core i7 CPU under the Microsoft Windows 10 operating system. GMFE (averaged 700 s) costs us much less time than SMFE (average 2550 s). If we take the computational time consumption as a principal consideration, then GMFE will be a better choice than SMFE.
Then, we calculated the number of key frames selected by our methods for each event in the videos. Let us take brownie 08 as an example. Fig. 10 and Fig. 11 show the results, from which we can find that the key frames using proposed algorithms that use multi-information represent the events better than those using only pure information.
From the above discussion, the SMFE algorithm has fewer parameters (only one parameter) than the GMFE algorithm. Thus, SMFE is easier to adjust and more robust with respect to different videos. However, GMFE has less computational time consumption. Thus, GMFE is more efficient in high-dimensional situations.

VII. CONCLUSION
We proposed novel frameworks for key-frame extraction from FPV videos based on sparse modelling and graph modelling by multi-sensor integration. The deep features from a pre-trained DNN rather than raw video frames are used for key-frame extraction. The index of the key frame was then estimated by the proposed algorithms, which are proven to be more informative and elegant when extracting the key frames from FPV videos. The experimental results indicate that the proposed approaches can achieve a modest enhancement over pure video data. The accuracy and entropy results demonstrate the effectiveness of the proposed algorithms. Moving forward, we will develop our approach by incorporating other non-video information, including text, audio, electromyograms, and heart rate signals.