Exploring Global Diversity and Local Context for Video Summarization

Video summarization aims to automatically generate a diverse and concise summary which is useful in large-scale video processing. Most of the methods tend to adopt self-attention mechanism across video frames, which fails to model the diversity of video frames. To alleviate this problem, we revisit the pairwise similarity measurement in self-attention mechanism and find that the existing inner-product affinity leads to discriminative features rather than diversified features. In light of this phenomenon, we propose global diverse attention which uses the squared Euclidean distance instead to compute the affinities. Moreover, we model the local contextual information by novel local contextual attention to remove the redundancy in the video. By combining these two attention mechanisms, a video SUMmarization model with Diversified Contextual Attention scheme is developed, namely SUM-DCA. Extensive experiments are conducted on benchmark data sets to verify the effectiveness and the superiority of SUM-DCA in terms of F-score and rank-based evaluation without any bells and whistles.


I. INTRODUCTION
With the rise of video-sharing websites (e.g., YouTube and Facebook), the demand for video analysis surges rapidly.From a content producer's perspective, it is not delightful experience for them to process the long videos.Under this circumstance, the automatic video processing techniques are needed desperately.Video summarization is one of the techniques for handling the massive video data, which removes the redundancy by selecting diverse segments from the video as video summary, and automated methods for generating summary are needed to be investigated.
In the past few decades, various approaches [1]- [8] have been proposed to automatically summarize untrimmed videos.Some works [1], [3], [9], [10] leverage the Recurrent Neural Networks (RNNs) [11] and Long Short-Term Memory (LSTM) [12] for video summarization by modeling the temporal information and show great success.However, these models would fail to handle long videos since recurrent models are not able to model long-range dependency across video frames.This is because the recurrent models tend to suffer serious decay of the history information in terms of long sequences [13].Recently, attention-based methods [6]- [8], [14] have been proposed to alleviate this problem by directly computing the pairwise matrix over the whole video sequence.However, there are several drawbacks: 1) pure selfattention mechanism over all video frames cannot model the diversified feature representation thus is not suitable for video summarization; 2) local temporal cues are unexplored for identifying the most representative essential in local context.
For the former bottleneck, previous approaches [8], [15] tend to adopt self-attention mechanism or multi-head attention mechanism to capture the temporal relation over video frames, which can be implemented by a pairwise frame similarity matrix construction and weighted average summation over all frames.These methods simply adopt the dot product as the default pairwise similarity measurement, and we argue that it is not proper for video summarization task.This is because the frame pair with larger magnitude of the weight would suppress the representation of other frames therefore producing the discriminative feature representation for the whole video.But for video summarization, a good summary should reflect diversified semantic information of video, which cannot be satisfied by the dot product similarity measurement.In consequence, we develop the global diverse attention to quantify the importance of each video frame and simultaneously promote the diversity among these frames.In concrete, we find that the choice for pairwise similarity measurement in the pairwise relation modelling is vital.We use L 2 similarity to substitute dot product as the similarity measuring function, which leads to more diversified feature representations.Besides, the proposed global diverse attention mechanism shares the similar computation with pure self-attention mechanism by matrix operations, and can be fully optimized by GPU parallel acceleration.
For the latter bottleneck, most existing methods [6], [15] tend to model the video relation globally while local temporal evolution across consecutive frames is not adequately exploited.Inspired by the concise characteristic of video summarization, the most representative information within a short video segment should be identified and extracted to reduce the redundancy.For further illustration, the beginning of the event and the ending together would foreshadow the happening of the event which should be included in the summary.To this end, we propose a local contextual attention mechanism to identify the discriminative features by modelling the local context information.In particular, the pairwise similarities between an anchored frame and its adjacent frames are computed, then the local contextual feature is generated by weighted aggregation of the adjacent frames.Therefore, the local contextual feature not only includes the representation of the original frame but also integrates local dependency among adjacent frames.In a nutshell, by combining global diverse attention and local contextual attention, we formulate a Diversified Contextual Attention (DCA) scheme and propose a model named SUM-DCA to address the above limitations, which we believe are significant signs of progress for video summarization.
The main contributions can be highlighted as follows: • A diversified contextual attention scheme is developed to model the diversified contextual representation of video by using the pairwise relations among frames, which enables the model to generate diversified and concise summaries.• By delicately selecting the pairwise similarity function that influences the magnitude of frame relations, SUM-DCA is able to generate a diverse representation that conventional self-attention mechanism fails to capture.
• Extensive experiments are conducted on the benchmark data sets.The results demonstrate that our model outperforms other competing approaches on SumMe and TVSum datasets.

II. RELATED WORK A. VIDEO SUMMARIZATION
Video summarization has been widely explored in multimedia analysis with great potential, which can be categorized as two main streams: unsupervised approaches and supervised methods.Our model can be trained in both supervised and unsupervised fashions.

1) Unsupervised Video Summarization
The unsupervised methods mainly focus on designing heuristic criteria to choose the key shots in terms of representative, diversity, and relevance [16]- [18].VSUMM [19]

2) Supervised Video Summarization
In supervised video summarization, recurrent neural networks (RNN) have been widely adopted in recent years [1], [6], [9].Zhang et al. [1] use bi-directional LSTM to model the temporal dependency of video frames and further introduce determinantal point processes to model the diversity of selected frames.To consider the shot relations within the video, Zhao et al. [9] develop a hierarchical structureadaptive rnn to model the intra-shot relation.Instead of using the RNN, SUM-FCN [22] uses the 1D fully convolutional neural network to capture the local information of video frames.Besides, Jungji et al. [23] construct a recurrent graph to model the temporal relation between video frames with residual learning.Jiri et al. [15] introduce the self-attention mechanism for modelling the global information of video frames.In addition, Li et al. [6] propose the diverse attention mechanism to capture the global diversity between video frames.Different from previous supervised methods, the proposed model not only captures the local context of the video but also models the global diversity by scrutinizing the self-attention mechanism.

B. VIDEO HIGHLIGHT DETECTION
Video highlight detection [24]- [26], a relevant task to video summarization, aims to select the most representative seg-

Local Contextual Attention
Score Regression

Feature Reconstruction
< l a t e x i t s h a 1 _ b a s e 6 4 = " N + t 7 S 9 W P q   x i G X S P q + 7 l 3 X 3 / q L a u C l C K q F j d I J q y E V X q I H u U B O 1 E E E S v a B X 9 G Y 9 W + / W h / U 5 a 1 2 x i p k j N A f r 6 x e s 3 5 p g < / l a t e x i t > Overview of SUM-DCA.Given an input video, SUM-DCA first extracts the features of video frames, then uses diversified contextual attention scheme to model the diversified contextual features X.In the final, it utilizes score regression module to generate a summary automatically.Note that X g and X l are global diversified features and local contextual features, respectively.
ment from a untrimmed video.Yet the video summarization requires the integrity of the whole video, which does not solely involve the most representative segments.Various studies have been explored in recent years.Gygli et al. [24] create the Video-GIF pairs for ranking the video segments to select the highlight segment.To avoid the heavily human annotation, Xiong et al. [27] mine the relation between video duration and video highlight by stating the short videos are more likely to contain highlight.To further exploit the video information, Hong et al. [25] address the video highlight detection problem not only with visual information but also with audio features.Furthermore, Ye et al. [26] propose a low-rank audio-visual fusion scheme with modelling the temporal dependencies among video segments for better localizing the video highlight segments.In addition, Badamdorj et al. [28] introduce noise sentinel to adaptively discount a noisy visual or audio modality during audio-visual fusion.

III. THE PROPOSED APPROACH
In this work, we elaborate the SUMmarization model with Diversified Contextual Attention (SUM-DCA) for video summarization.The overview of SUM-DCA is illustrated in Figure 1.Specially, the proposed diversified contextual attention scheme contains global diverse attention modelling and local contextual attention modelling, which exploring the diversified frame representation over all frames and mining the local temporal cue in the consecutive frames.Essentially, the global diverse attention models the pairwise relations over the whole video via a pairwise similarity measurement with negative squared Euclidean distance, which performs diversified frame representations with respect to the whole video.Meanwhile, the local contextual attention is able to recognize the most representative frame within a local region by modelling the local temporal contextual information.Finally, we explain the optimization of SUM-DCA and the details of inference.
r f X 9 M 5 a j a P j 0 q Q 5 t I E 2 0 Q 7 y 0 D 4 6 Q q e o j b q I o g f 0 j F 7 R W + W x 8 u 5 U n e n P U a d S 7 q y j H + E s f w A f U 7 W P < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " n s P 4 + X A a j N w C p a y / P E 2 y / n c c Z L g = " The global diverse attention mechanism.W Q , W K , and W V are trainable projection matrices that project the video features X into different subspaces.The pairwise similarity measurement s(•, •) quantifies the similarities between different frame pairs, which acts as the important role in generating different representation.

A. GLOBAL DIVERSE ATTENTION
Previous methods [1], [9], [10] estimate the frame importance for video summary directly, without capturing the diversity of the selected frames which is a pivotal characteristic of video summary.In consequence, we propose global diverse attention that exploits the pairwise relations from the video frames for encoding the diversified frame features.Given a video V = {v 1 , • • • , v T } with T frames, a pretrained CNN network, e.g.GoogLeNet [29], is utilized to ex-tract the corresponding frame features As depicted in Figure 2, the pairwise relation matrix A ∈ R T ×T that reveals the underlying temporal relations is derived, and each entry A ij is the measurement of the similarity between i-th frame and j-th frame, which is calculated as: where s(•, •) : R d × R d → R is the pairwise similarity measurement.W Q ∈ R d×d and W K ∈ R d×d are learnable linear projection matrices.q is the scaling factor, and we set q = d empirically to avoid the small gradient during the back-propagation.Then, the global diverse attention weights are obtained through applying the softmax normalization as follows: With the normalized global diverse attention weights Ã, the encoded global diversified features can be computed as a weighted sum of projected video features with efficient matrix multiplication operation: where W V ∈ R d×d is a trainable linear projection parameter.Moreover, we add the positional information P ∈ R T ×d to the video sequence X before applying global diverse attention scheme in order to preserve the temporal order information.In detail, P is added to X before calculating the pairwise similarity A. Following [2], we use the sinusoidal positional embedding which is defined as: The pairwise similarity measurement s(•, •) : R d × R d → R is crucial for generating the diverse and informative features as it supports the feature relation modelling for deriving global diverse attention.One of the common choices is the dot product, i.e. s(u, v) = u T v, which makes our global diverse attention mechanism is equivalent to the standard selfattention mechanism [2].However, if we cast the aggregation procedure in Eq.( 3), it can be observed that the frame pair with a larger magnitude of the attention weight Ãij would predominately suppress the representation of other frame pairs.This is practical in some tasks, i.e., video classification, since some pixels that involves objects contain more information than meaningless background pixels.Therefore, the pair with larger magnitude would results in a higher impact on feature representation.But for video summarization, which aims to generate a comprehensive and diverse collection of video segments instead of generating the most representative segment of the video.To handle this problem, we choose L 2 similarity for s(u, v), which is defined as:  where • 2 is the L 2 norm.The naive implementation of L 2 similarity involves O(T 2 ) times computation for loop, which is much slower than the dot product which can be simply implemented within matrix multiplication.To accelerate the computation and fully utilize the parallel characteristic of GPU hardware, we decompose the Eq.( 6) as: which shares a similar computation with dot product and can be easily implemented with matrix operations.
To give an intuition why L 2 similarity can lead to more diversified feature representation than dot product, we visualize a simple but intuitive case under 2-D feature space.As illustrated in Figure 3, we randomly generate three feature points in the 2D-space, and compute the softmax-contributions (i.e., global diverse attention weights Ã) over these three points with respect to any reference point in the space.With the dot product measurement, it can be observed that the blue point dominates the representation of the reference point, which leads to a more discriminative representation.On the other hand, under L 2 similarity measurement, each point has the chance to contribute to the feature representation of the reference point resulting in more diversified feature representation over all points.

B. LOCAL CONTEXTUAL ATTENTION
Although the global diverse attention models the frame-wise diversity within a video, the temporal cues in the videos are totally ignored.Intuitively, the segment in the summary often contains the most representative essential from the temporal context, i.e., the adjacent video frames around the selected summary frames.Therefore, we argue that temporal contextual information is also pivotal for video summarization.We propose the local contextual attention mechanism to integrate the local information over consecutive frames.The proposed local contextual attention mechanism is able to recognize the most informative frame among the similar adjacent frames as the summary candidates by capturing the m e w W O j 9 G A o p B m u 4 U T 9 e 5 G i S K l x F J j N P K e a 9 3 L x P 6 + T 6 P D K S y m P E 0 0 4 n j l a t e x i t > x h < l a t e x i t s h a 1 _ b a s e 6 4 = " g i n 9 U b E u o k 9 D G y i S 6 w D h T + L B o u g = " > A A A C D 3 i c b V D L S s N A F J 3 U V 6 2 v q E s 3 g 0 U Q h J K I q M u i G 5 c V 7 A P a E C b T S T t 0 J h N m J t U S 8 h H u 3 e o v u B O 3 f o J / 4 G c 4 a b O w r Q c u H M 6 5 l 3 s 4 Q c y o 0 o 7 z b Z V W V t f W N 8 q b l a 3 t n d 0 9 e / + g p U Q i M W l i w Y T s B E g R R i P S 1 F Q z 0 o k l Q T x g p B 2 M b n O / P S Z S U R E 9 6 E l M P I 4 G E Q 0 p R t p I v m 3 3 O N L D I E y f M j 8 d n r m Z b 1 e d m j M F X C Z u Q a q g Q M O 3 f 3 p 9 g R N O I o 0 Z U q r r O r H 2 U i Q 1 x Y x k l V 6 i S I z w C A 1 I 1 9 A I c a K 8 d J o 8 g y d G 6 c N Q S D O R h l P 1 7 0 W K u F I T H p j N P K d a 9 H L x P 6 + b 6 P D a S 2 k U J 5 p E e P Y o T B j U A u Y 1 w D 6 V B G s 2 M Q R h S U 1 W i I d I I q x N W X N f e M I 0 l e I x q 5 h q 3 M U i l k n r v O Z e 1 t z 7 i 2 r 9 p i i p D I 7 A M T g F L r g C d X A H G q A J M B i D F / A K 3 q x n 6 9 3 6 s D 5 n q y W r u D k E c 7 C + f g H T h J 0 o < / l a t e x i t > T F k c 9 p 2 x 9 e F 3 7 7 g U r F R H w H o 4 R 6 E e 7 H L G Q E g 5 Z 6 5 q E L j A c 0 c y M M A z / M 6 n l + n w 3 y n l m x q / Y Y 1 i J x p q S C p m j 0 z B 8 3 E C S N a A y E Y 6 W 6 j p 2 A l 2 E J j H C a l 9 1 U 0 Q S T I e 7 T r q Y x j q j y s n H a z W m S A J S a g K 5 v 5 E q U c m B S P e V l X 4 8 w X s U h a Z 1 X n o u r c n l d q 9 W l J J X S E j t E p c t A l q q E b 1 E B N R N A T e k G v 6 M 1 4 N t 6 N D + N z s r p k T G 8 O 0 A y M r 1 9 m w K A 3 < / l a t e x i t >

Video Features
< l a t e x i t s h a 1 _ b a s e 6 4 = " w o y d 3 S + P l 8 f 7 P + Z e I 3 O 0 9 4 M K 9 The illustration of local contextual attention mechanism.It can aggregate the discriminative frames and mining the temporal cues within the local window.
temporal contextual cues, thus avoiding the redundancy of the generated summary.As shown in Figure 4, for each anchored frame x h ∈ R d , where h ∈ {1, 2, • • • , T }, we restrict its attention region to a local scope with its 2R adjacent frames: Then, the local contextual pairwise matrix B h ∈ R (2R+1)×(2R+1) can be computed through two linear projection, and each entry B h ij can be computed as: where i, j ∈ {0, 1, are the trainable relative positional embedding vectors that can attend relative distances within the local windows [8].
Next, the local contextual attention weights Bh ij within the local window centered at position h is calculated as: To capture the local contextual information anchored at position h modelled by the local contextual attention weight matrix Bh ∈ R (2R+1)×(2R+1) , we apply linear projection to the anchored feature x h as: where the linear projection matrix W V ∈ R d×d is the parameter to be learned.x l h ∈ R d is the weighted vector reflecting the local context of the h-th frame.

C. THE SUM-DCA MODEL
By incorporating the global diverse attention mechanism and local contextual mechanism, we can get the global diversified features X g ∈ R T ×d and the local contextual features X l ∈ R T ×d .Then, these two types of features are combined with the original frames representation X as: where is the diversified contextual features for the video V. Then the diversified contextual features are handled by the score regression function y(•) and the embedding function φ(•).In detail, the score regression function is implemented by two fully-connected layers with the ReLU [30] activation function and sigmoid function respectively, which outputs the frame importance score y ∈ R T .
For training the SUM-DCA model, we employ three loss functions, i.e. classification, repelling, and reconstruction losses.Our model can be trained in both supervised and unsupervised manners.In specific, the supervised setting uses all three losses, and the unsupervised setting only uses the repelling loss and reconstruction loss.

a: Classification Loss
We use the binary cross-entropy loss for classifying each frame, which is defined as: where ŷi is the ground-truth annotation for i-th frame, and y i is i-th frame importance score.

b: Repelling Loss
In order to further represent the diversity of the video frames, we employ the repelling loss [10] to enhance the diversity of frames which computes the mean value of pairwise cosine similarities between all of T frames as: where φ(x i ) ∈ R d is the embedding vector of i-th frame.

c: Reconstruction Loss
A good summary should contain the main content of the video which indicates that the summary has a large reconstruction capacity.Therefore, we use the reconstruction loss to reconstruct the features corresponding to the frames, such as: Learned model parameters: Θ. 1: Initialize all parameters denoted by Θ using Xavier.2: Extract frame-level features X m ∈ R T ×d for all videos.Use X m to calculate global diversified features X g using Eqs.(1-3).

8:
Calculate frame score y i by score regression y(•).
where ϕ(•) is the reconstruction function which is implemented by a two layers fully-connected networks with Sigmoid activation function.Now, we can obtain the final loss for SUM-DCA in the supervised setting as follows: where α and β are the hyperparameters for controlling the trade-off among three losses.Besides, we also modify the loss to extend SUM-DCA in the unsupervised scenario, i.e.SUM-DCA unsup , by omitting the classification loss L cls as: During the training stage, the above loss functions are optimized iteratively, the training procedures are detailed in Algorithm 1.

D. SUMMARY GENERATION
For summary generation, a set of key shots is selected by maximizing the frame scores.In specific, we follow [1], [21] to generate a set of changing points using Kernel Temporal Segmentation (KTS) [31] therefore dividing the video into total S shots.The summary proportion constraint l ≤ T is applied for controlling the length of the generated summary.Then, the key shots are selected by the 0/1 Knapsack algorithm [32], which is formulated as: Algorithm 2 Video Summary Generation

Input:
Test video V and model parameters Θ. Output: Video Summary S.
by a pre-trained CNN model.3: Use KTS [31] to divide the video V into S shots {S i } S i=1 .
5: Calculate local contextual features X l by Eqs.(9)(10)(11).6: Get diversified contextual features X via Eq.( 12).7: obtain frame scores y i for each diversified contextual features Xi with frame score regression module y(•).8: Solve optimal p i for each shot S i through Eq.( 18).9: for all shot S i in video V do 10: S ← S ∪ {S i }. 12: end for 13: return S. where s i indicates the average score over the frame within the i-th shot generated by KTS, and l i is the length of i-th shot.
If p i = 1, the i-th shot is chosen to compose the summary.The video summary generation steps are summarized in the Algorithm 2.

IV. EXPERIMENTS A. DATASETS
We employed four data sets for this paper, including SumMe [33], TVSum [34], Open Video Project (OVP) [19], and YouTube [19].SumMe consists of 50 videos within 1-5 minutes length that have various topics, such as news, documentary, how to videos, etc. SumMe data set is a collection of 25 user videos that record different events including holidays, history, and sports.The length of videos in SumMe varies from 1.5 minutes to 6.5 minutes.For YouTube and OVP data sets, 39 and 50 videos are collected with cartoons, news, and sports topic, respectively.These data sets are diverse in terms of the content and come with different type of annotations, i.e., shot-level scores for SumMe and frame-level scores for TVSum.We use SumMe and TVSum data sets for training  [3] 0.020 0.026 HSA-RNN [9] 0.082 0.088 VASNet [2] 0.082 0.088 SUM-GAN [10] -0.054 -0.070 SUM-FCN [22] 0.011 0.014 CSNet [4] 0.070 0.091 HMT [8] 0

E. QUANTITATIVE RESULTS
We compare the proposed SUM-DCA with several state-ofthe-art methods including both supervised and unsupervised approaches in terms of three settings described in Table 1.
Table 2 summarizes the experimental results of different supervised approaches.We can find that our model achieves the best performance on the SumMe data set in all three settings, while obtaining the best generalization performance in terms of Transfer setting on both data sets.In particular, our method yields a better performance at least by 6.1% higher than the RNN-based method (i.e., Bi-LSTM [1], DPP-LSTM [1], DR-DSN sup [3], HSA-RNN [9], and CSNet unsup [4]) on SumMe under canonical setting and 9.0% under the augmented setting.This is because the RNN-based methods fail to model the long-term dependency, thus cannot capture the long-term context information for effectively summarizing the video.Besides, we can observe that SUM-DCA outperforms the attention-based methods (i.e., M-AVS [14], VASNet [15], HMT [8], SUM-GDA [6], and HMANet [7]) by a large margin due to the proper choice of the pairwise similarity measurement s(•, •).We demonstrate attention mechanism is crucial for modeling the global information of the video in terms of summarization.For example, M-AVS [14] utilizes the encoder-decoder structure to model the additive attention for measuring the similarity, and VASNet [15] adopts the pure self-attention mechanism for encoding the global similarity information.These approaches, however, solely model the similarity among video frames resulting in the discriminative video features.SUM-GDA [6] models the global dissimilarities of the video frames instead of the similarity, and achieves a relative gain of 3.1 % on SumMe with the canonical setting.Our proposed SUM-DCA not only yields the diversified frame features but also captures the local context among several frames, therefore leading to the higher performance than SUM-GDA in all aspects.Table 3 presents the the experimental results of unsupervised methods.It can be observed that our method is able to achieve the comparable performance among competing unsupervised approaches.Typically, compared to the GANbased methods (i.e., SUM-GAN [10], Cycle-SUM [5], and UnpairedVSN [21]), our pure attention model achieves a relative gain of 2.9% on SumMe data set and 1.6% on TVSum with the canonical setting.In addition, we also notice that the difference of performance over three settings on TVSum is relatively smaller among all of the methods than on SumMe.This might be due to the fact that SumMe is more challenging and adopts the highest F-score among several users which shows more targeted when doing evaluation, while TVSum adopts the average F-scores among several users, and the users are not likely to make the consistent agreement.
Moreover, we evaluate the summarization performance by the rank-based evaluation, which compute the correlation between the predict probabilities and the annotated importance scores by human.Two rank-based metrics are employed in this paper, i.e.Kendall's τ and Spearman's ρ.The results are summarized in the Table 4.As we can observe in the table, the performance of random selection and human annotation are the lowest and the highest respectively.In particular, our SUM-DCA surpasses other state-of-the-art method by a significant margin.Besides, with the help of annotation, the SUM-DCA performs better than SUM-DCA unsup in terms of both τ and ρ.Overall, the results in Table 4 indicate the advantages of the proposed SUM-DCA with the following aspects: 1) The proposed global diverse attention mechanism can capture the global dependencies among frames while model the diversified frame representation.2) The local contextual attention is able to integrate the local information over consecutive frames, which is useful for avoiding the duplication during summary generation.

F. ABLATION STUDIES
In this section, we perform ablation study for verifying the effectiveness of each component in our model under different conditions.

1) Impact of Global Contextual Attention Scheme
Firstly, to examine the influences of global contextual attention scheme, we conducted ablation on our proposed global contextual attention scheme and Table 5 tabulates the main results.It suggests that there is a significant performance drop when the global diverse attention is not used.This indicates that the diversified frame representation is essential for generating a satisfied summary.Besides, the local contextual attention alone (Exp No.3) also vital for summarization since it utilizes the local temporal cues thus ruling out the redundancy among adjacent frames.Moreover, by applying global contextual attention (Exp No.4), our method achieves the highest performance in both two data sets.
2) Impact of Pairwise Similarity Measurement s(•, •) Secondly, we have proved that L 2 similarity is more suitable for s(•, •).The results in Table 6 demonstrate that L 2 similarity improves the performance with negligible overhead on both SumMe and TVSum.In particular, we can find that the cosine similarity performs better than the plain dot product since the normalization of vectors suppresses the model to generate the discriminative representation to some extent.In addition, the L 2 similarity outperforms the rest of the two similarity measurements by at least 2.8% on SumMe and 0.6% on TVSum.This is consistent with the content of Section III-A -the L 2 similarity can lead to more diversified feature representation thus boosting the summarization performance.
3) Impact of Neighbor Size R Then, we investigate the effect of varying the neighbor size R for local contextual attention, as described in Table 7.It can be observed that the small value of R (i.e., R = 2 or R = 3) would gain the best performance.This suggests local contextual attention within small region would find the most discriminative features among the similar adjacent frames, which fully utilizes the local temporal cue and avoids the redundancy during the summary generation.Besides, for the large R, it might model excessive context which hinders the model from learning discriminative frame features among similar frames, thus leading to degenerated summary performance.

4) Impact of Loss Terms
Furthermore, we ablate the contributions of individual loss terms and the performances are shown in Table 8.As can be seen from the table, for the supervised setting, the classification loss L cls alone (Exp No.1) yields the lowest F-scores.Adding the repelling loss (Exp No.4) and reconstruction loss (Exp No.5) would improve performance significantly, while combining all three losses (Exp No.7) leads to the best performance and surpass the model with L cls only by at least 2.6% on SumMe and 3.7% on TVSum respectively.However, in the unsupervised setting, as the ground truth annotations cannot be used, the classification loss is removed.This causes a significant degrade in the performance compared with models in a supervised setting.

G. QUALITATIVE STUDY
To further intuitively evaluate our model, we visualize some video summaries and the probability curves predicted by SUM-DCA in Figure 5.As we can observe in the figure, the predicted curves are consistent with the human-annotated score curves.In addition, the rank statistics τ and ρ indicates our model can perform as well as humans.In Figure 5a, it demonstrates news about traffic accidents, the selected key shots perfectly summarize the whole news without any redundancy.In Figure 5b, the topic of the video is about daily life.Our model selects the diverse shots from the video showing that the effectiveness of diversified contextual attention scheme.

H. DISCUSSION
Our method models contextual information along the video sequence and the diversity among video frames.However, the pairwise similarity measurement is still hand-crafted and cannot be learned during the model optimization.Besides, the partition of video shots still needs to be improved since the quality of the video shots is also important for video summary performance.In future work, we aim to improve the model by incorporating the audio information since we only utilize the visual feature of the video.

V. CONCLUSION
This paper has proposed a novel video summarization model named SUM-DCA with diversified contextual attention scheme, which exploits not only global diversity but also local contextual information among video frames.To explore the global diversity, L 2 similarity measurement is adopted, which is superior to dot product similarity.Moreover, we utilize the local temporal cue to find the discriminative features through local contextual attention.For proving the effectiveness of the proposed SUM-DCA, we conduct comprehensive experiments as well as the ablation studies on two publicly available data sets.Empirical results have verified that both SUM-DCA and SUM-DCA unsup perform better than other state-of-the-art methods.
utilizes k-means to group the visually similar frames into several clusters by color features.Mei et al.[20] treat the summarization problem as L 2,0 -constrained sparse dictionary selection problem and propose the simultaneous orthogonal matching pursuit (SOMP) algorithm.Recently, the deep learning approaches show the great power for unsupervised video summarization.Mahasseni et al.[10] put forward an adversarial LSTM networks for generating the summary with summary discriminator.Moreover, Yuan et al.[5] add cycleconsistency constraint to it to sufficiently align the video and its summary resulting in the comparable summarization performance.Besides, Zhou et al.[3] formulate the summarization problem into a reinforcement learning framework with diversity reward.Rochan et al.[21] leverage the pairs of videos to learn the summarization model.Different from these unsupervised methods, we use a simple but intuitive attention model to extract the diversified contextual representations from the video.The pairwise similarities between video frame pairs and the reconstruction process are used to construct objective functions for training.The optimization of our method is more efficient since it does not require adversarial training and reinforcement learning.
9 X v c u 6 9 7 D R a 1 x U 5 R U B s f g B J w B D 1 y B B r g D T d A C C E z A C 3 g F b 9 a z 9 W 5 9 W J / z 1 Z J V 3 B y B B V h f v y i x n e w = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " D u k F G O q + t C O r L / w O 3 z 1 h D k 9 C 3 m 4 = " > A A A C D n i c b V D L S s N A F J 3 4 r P X R q E s 3 w S K 4 K o m I u i y 6 c e G i g n 1 A G 8 J k M m m H z i P M T J Q S 8 g / u 3 e o v u B O 3 / o J / 4 G c 4 a b O w r Q c u H M 6 5 l 3 s 4 Y U K J 0 q 7 7 b a 2 s r q 1 v b F a 2 q t s 7 u 3 s 1 e / + g o 0 4 M 1 5 d t 6 d D + d z t r r i F D e H Y A 7 O 1 y + E X Z 0 I < / l a t e x i t > L r < l a t e x i t s h a 1 _ b a s e 6 4 = " c S G 1 u x H K x O Z 7 j Z N t s U F Z 1 Y 4 u f 8 o = " > A A A C B X i c b V D L S g N B E J z 1 G e M r 6 t H L Y B D i J e y K q M e g F 4 8 R z A O S J c z O z i Z D Z n a W m V 5 l C T l 7 9 6 q / 4 E 2 8 + h 3 + g Z / h J N m D S S x o K K q 6 6 e 4 K E s E N u O 6 3 s 7 K 6 t r 6 x W d g q b u / s 7 u 2 X D g 6 z 7 L w 7 H 8 7 n r H X F y W e O 0 B y c r 1 9 f o 5 k e < / l a t e x i t > y(•) < l a t e x i t s h a 1 _ b a s e 6 4 = " R 9 6 k n / D B Z S P e U B z 4 8 4 s + K 9 P 7 i 3 8 = " > A A A C C H i c b V D L S s N A F J 3 4 r P V V d e k m W I S 6 K Y m I u i y 6 c V n B P i A J Z T K Z t E N n M m H m R i m h P + D e r f 6 C O 3 H r X / g H f o b T N g v b e u D C 4 Z x 7 u f e e M O V M g + N 8 W y u r a + s b m 6 W t 8 v b O 7 t 5 + 5 e C w r W W m C G 0 R y a X q h l h T z h L a A g a c d l N F s Q g 5 7 Y T D 2 4 n f e a R K M 5 k 8 w C i l g c D 9 h M W M Y D C S 5 6 c D V v N J J O G s V 6 k 6 d W c K e 5 m 4 B a m i A s 1 e 5 c e P J M k E T Y B w r L X n O i k E O V b A C K f j s p 9 p m m I y x H 3 q G Z p g Q X W Q T 0 8 e 2 6 d G i e x Y K l M J 2 F P 1 7 0 S O h d Y j E Z p O g W G g F 7 2 J + J / n Z R B f B z l L 0 g x o Q m a L 4 o z b I O 3 J / 3 b E F C X A R 4 Z g o p i 5 1 S Y D r D A B k 9 L c F p F x Y E o + j c s m G n c t e x i t s h a 1 _ b a s e 6 4 = " W N r h H X 7 v D V 7 J 7 h F Q Q h G d p F W Y 4 / g = " > A A A C B 3 i c b V D L S s N A F J 3 U V 6 2 v q k s 3 w S K 4 K o m I u i y 6 c V n B P r A t Z T K d t E P n E W Z u l B L y A e 7 d 6 i + 4 E 7 d + h n / g Z z h p s 7 C t B w Y O 5 9 z L P X O C i D M D n v f t F F Z W 1 9 Y 3 i p u l r e 2 d 3 b 3 y / k H T q F g T 2 i C K K 9 0 O s K G c S d o A B p y 2 I 0 2 x C D h t B e O b z G 8 9 U m 2 Y k v c w i W h P 4 K F k I S M Y r P T Q F R h G Q Z i 0 0 3 6 5 4 l W 9 K d x l 4 u e k g n L U + + W f 7 k C R W F A J h G N j O r 4 X Q S / B G h j h N C 1 1 Y 0 M j T M Z 4 S D u W S i y o 6 S X T x K l 7 Y p W B G y p t n w R 3 q v 7 d S L A w Z i I C O 5 k l N I t e J v 7 n d W I I r 3 o J k 1 E M V J L Z o T D m L i g 3 + 7 4 7 Y J o S 4 B N L

F b 0 5 z 8 6 7 8 +
F 8 z k Y L T r 5 z i O b g f P 0 C y L 2 a g A = = < / l a t e x i t > X < l a t e x i t s h a 1 _ b a s e 6 4 = " J G G z m k 0 b 9 X c 6 9 I g e f q Q V y A N t 5 K s= " > A A A C C X i c b V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w V R I R d V l 0 4 7 K C f U A T y 2 Q 6 a Y f O I 8 x M l B L y B e 7 d 6 i + 4 E 7 d + h X / g Z z h p s 7 C t B y 4 c z r m X e z h h z K g 2 r v v t l F Z W 1 9 Y 3 y p u V r e 2 d 3 b 3 q / k F b y 0 R h 0 s K S S d U N k S a M C t I y 1 D D S j R V B P G S k E 4 5 v c r / z S J S m U t y b S U w C j o a C R h Q j Y y X f 5 8 i M w i j t Z g + s X 6 2 5 d X c K u E y 8 g t R A g W a / + u M P J E 4 4 E Q Y z p H X P c 2 M T p E g Z i h n J K n 6 i S Y z w G A 1 J z 1 K B O N F B O s 2 c w R O r D G A k l R 1 h 4 F T 9 e 5 E i r v W E h 3 Y z z 6 g X v V z 8 z + s l J r o K U i r i x B C B Z 4 + i h E E j Y V 4 A H F B F s G E T S x B W 1 G a F e I Q Uw s b W N P e F J 8 x Q J Z + y i q 3 G W y x i m b T P 6 t 5 F 3 b s 7 r z W u i 5 L K 4 A g c g 1 P g g U v Q A L e g C V o A g x i 8 g F f w 5 j w 7 7 8 6 H 8 z l b L T n F z S G Y g / P 1 C 1 8 T m 1 4 = < / l a t e x i t > X l < l a t e x i t s h a 1 _ b a s e 6 4 = " k L T G H a t u s X G X g n m k A z c A Y v b N r I Y = " > A A A C E X i c b V D L S s N A F J 3 4 r P U V 6 9 J N s A i u S i K i L o t u X F a w D 2 h C m U w m 7 d B J J s z c q C X k K 9 y 7 1 V 9 w J 2 7 9 A v / A z 3 D S Z m F b D 1 w 4 n H M v 9 3 D 8 h D M F t v 1 t r K y u r W 9 s V r a q 2 z u 7 e / v m Q a 2 j R C o J b R P B h e z 5 W F H O Y t o G B p z 2 E k l x 5 H P a 9 c c 3 h d 9 9 o F I x E d / D J K F e h I c x C x n t e x i t s h a 1 _ b a s e 6 4 = " + W 0 u t 1 2 l 9 H 5 f F R E b 8 e J T e 5 u x a Y s s h a 1 _ b a s e 6 4 = " w o y d 3 S + P l 8 f 7 P + Z e I 3 O 0 9 4 M K 9 (a) Dot product similarity.(b) L2 similarity.

FIGURE 3 :
FIGURE 3: The visualization of softmax contributions (i.e., attention weights Ã) from three points.The space is colored as the most similar point under different pairwise similarity measurement.
t e x i t s h a 1 _ b a s e 6 4 = " A g f H j B T P d 0 e X b e n Q / n c 7 5 a c P K b Y 7 A A 5 + s X F s e c U A = = < / l a t e x i t > B h < l a t e x i t s h a 1 _ b a s e 6 4 = " b i t 2 L 3 3 r j

3 : repeat 4 :
for m = 1 to M do

TABLE 1 :
Three different evaluation settings for TVSum dataset.To evaluate on the SumMe dataset, the position of SumMe and TVSum should be switched.

TABLE 4 :
Performance comparison of rank statistics τ and ρ among different approaches.This experiment uses TVSum data set under canonical setting.

TABLE 5 :
Ablation on Global Diverse Attention (GDA) and Local Contextual Attention (LCA) mechanisms in our SUM-DCA model.This experiment uses SumMe and TVSum data sets under canonical setting.

TABLE 6 :
Ablation on the pairwise similarity measurement s(•, •) in global diverse attention mechanism.This experiment is conducted on SumMe and TVSum data sets under canonical setting.

TABLE 7 :
Variations in performance (F-score %) by changing the neighbor size R on SumMe and TVSum.

TABLE 8 :
Variations in performance (F-score %) by training SUM-DCA with difference losses on SumMe and TVSum.