Video Synopsis Based on Attention Mechanism and Local Transparent Processing

The increased number of video cameras makes an explosive growth in the amount of captured video, especially the increase of millions of surveillance cameras that operate 24 hours a day. Since video browsing and retrieval is time consuming, while video synopsis is one of the most effective ways for browsing and indexing such video that enables the review of hours of video in just minutes. How to generate the video synopsis and preserve the essential activities in the original video is still a costly and labor-intensive and time-intensive work. This paper proposes an approach to generating video synopsis with complete foreground and clearer trajectory of moving objects. Firstly, the one-stage CNN-based object detecting has been employed in object extraction and classification. Then, combining integrating the attention-RetinaNet with Local Transparency-Handling Collision (LTHC) algorithm is given out which results in the trajectory combination optimization and makes the trajectory of the moving object more clearly. Finally, the experiments show that the useful video information is fully retained in the result video, the detection accuracy is improved by 4.87% and the compression ratio reaches 4.94, but the reduction of detection time is not obvious.


I. INTRODUCTION
Video synopsis is a task with important research value and practical significance. With the development of 7 × 24 video surveillance system, the amount of surveillance video data is growing sharply. How to conduct video browsing and retrieval efficiently becomes a challenging work. Especially, in road monitoring or pedestrian monitoring, some moving objects needs to be identified and analyzed in a very long video. The traditional method is to browse the video by manually controlling the playback, which costs too much time. In addition, some important information will be missed due to the people's own reasons. Original videos contain a large number of ''static'' frames, each frame only contains a background, without any moving objects. In a sense, static frame is short of useful information, such as object moving The associate editor coordinating the review of this manuscript and approving it for publication was Shouguang Wang . trajectory. So it is necessary to design a method to shorten the browsing time of a large number of ''static'' pictures while completely retaining the useful information in the video. Hence the technology of video synopsis was born.
The earliest video synopsis technologies, such as storyboard [1], scene graph [2], and video manga [3], similar to the video ''skimming'' or video fast forward, are realized by selecting the ''key frames'' which contains more information in the video, then recombining the key frames into the synopsis video in chronological order. The main drawback of these existing methods is that they can't keep the complete trajectory of the moving object in the video, so the result of the synopsis video is not conducive to browse. With the advance of video processing technology, object-based video synopsis [4] has emerged, which focus on the motion of objects in the video. It uses ''activity'' as the processing unit, while the ''key frame''-based video synopsis uses ''key frame''. Activities can be shifted on the temporal domain, and VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ many activities can be shown in a same frame. Object based video synopsis can produce the synopsis video with a shorter time, while completely retaining the motion information in the original video. The critical processes of object-based video synopsis include four steps: object detection and tracking, activity rearrangement, background generation, and foreground paste [5], as shown in Fig. 1. The object detection and tracking are used to extract activities in the original video, which is the premise of video synopsis. The integrity of moving object extraction and the effect of trajectory tracking determine the quality of synopsis video. Traditional object detection algorithms such as [6]- [8] focus on feature extraction and detect the object based on hand-designed features. But they are not robust to illumination changes and camera jitter. Thus, the extracted foreground may be incomplete. In addition, the object to be detected is complex and changeable, so it is difficult to detect them through general abstract features. Compared with above methods, deep learning method can extract the object more completely because it is able to learn robust and high-level feature representations of an image [9]. However, in order to adapt to the task of video synopsis and generate synopsis video in a short time, the detection speed should be improved as much as possible on the premise of ensuring the accuracy of object detection. The most challenging work of activity rearrangement is collision processing, that is, how to deal with the trajectory crossing between two activities. Collisions can lead to chaotic scenes in the synopsis video which reduce the user's comfort of watching the video. Besides, different categories of moving object are shown in the same frame of the synopsis video, which will lead to visual confusion. The goal of activity rearrangement is to eliminate the impact of collisions on the synopsis video while displaying as many activities as possible. Existing work of activity rearrangement employ the energy function optimization [10] to guide activity rearrangement and get the low compression of synopsis video, while the computing resources and time cost too much.
Convolution neural network (CNN) is a kind of technique which allows to learn features from raw image datasets in a task-oriented process. It can be used in different tasks, including image classification, object detection and object tracking and works well because of the high level of characteristics obtained [11]. In this paper, an approach is proposed to enhancing the quality of video synopsis. This method consists of the following parts: First, we use the ''attention mechanism'' [12] on CNN to improve the effect of object detection and make it suitable for video synopsis. Second, we make a dataset to train our object detection model, in order to realize classified video synopsis. Third, we propose LTHC algorithm to reduce computing resource consumption.
The main contributions of this paper are summarized as follows: • An method is proposed for video synopsis task, which is combined with one-stage CNN-based object detector and local transparent processing. Our approach improves the quality of video synopsis.
• By using CNN-based object detection network, the label generated by the detector can be used and a method of classified video synopsis is proved to be feasible.
• The accuracy of CNN-based detection method is improved by using attention mechanism. This method can improve the precision of detector without adding too many training parameters.
The rest of this paper is organized as follows. Section II states some related work on video synopsis and object detection. A proposed approach to generating video synopsis with complete foreground and clear trajectory is presented in Section III. And Section IV illustrates some experiments and the proposed methods are evaluated and analyzed with experimental settings, baseline methods and experimental results. Finally, some conclusion remarks and future work are given out in Section V.

II. RELATED WORKS A. OBJECT-BASED VIDEO SYNOPSIS
Video synopsis is the simultaneous presentation of events that enables the review of hours of video footage in just minutes, that is, how to show all the moving objects in the original video in a short time. According to the different forms of the result of video synopsis, video synopsis can be divided into the key-frame-based video synopsis and object-based synopsis. This paper focuses on object-based video synopsis, which is also known as dynamic video synopsis.
Object-based video synopsis is an activity-based video condensation technique. The aim of the object-based video synopsis is to display as many activities as possible simultaneously in the shortest time period [5]. The main feature of this technique is that activities from different periods can be shift into the same frame by analyzing the object motion. Thus an efficient condensation performance is achieved.
Object-based video synopsis was proposed by Rav-Acha et al. [13], Pritch et al. [4], [10], [14]. The method consists of on-line part and off-line part. The former generates activities and the latter rearranges the activities. Their study is important because they provide a general framework of video synopsis for the first time. Most of the subsequent research work focuses on the steps presented in Fig.1. In terms of object detection and tracking, Pritch et al. [13] use pixel difference with temporal median method, which is a basic algorithm to exact foreground. Some studies such as [15]- [18] et al. apply the background modeling method based on statistics like GMM to the object detection of video synopsis. Besides, Mahapatra et al. [19] contribute to the field by using human detection method because this method provide more precise results as the false detection ratio is lower. It's worth noting that Jin et al. [20] use R-CNN with Chi-square distance for the video synopsis, where CNN-based object detection method has been used in video synopsis. CNN-based detector makes the object extracted more complete in video synopsis. And inspired by their work, this paper uses one stage object detection instead of two stage method, which results in faster speed and better effect for video synopsis. Optimization method is the key point for activity rearrangement and it aims to find the best which aims to find best arrangement of the activities in order to display most of them and avoid collision as far as possible. Energy function is designed and minimized by optimization algorithm. Simulated annealing algorithm [21] is the most commonly used algorithm, which is used in many studies [4], [14], [22]- [25]. After that, mean-shift [15], genetic algorithms [26] and the packing cost [13] based optimization methods are used. These off-line optimization methods are very slow in the condition of too many activities. On-line methods rearrange the activities while detecting them, and this kind of optimization method requires less memory and reduces computational complexity. An online step wise optimization has been done in endless video [27], which adopts an online content-aware approach in a step-wise manner, and sticky tracking is used to achieve high-quality visualization with less computational cost. Huang et al. [28] use table-driven approach, they design a synopsis table to represent each pixel in a video frame with index of detected object and frame. The table would be updated while the objects or the frames are detected. The shortcoming of on-line optimization method is that it cannot achieve a higher compression ratio than that of the off-line method.
In recent years, with more and more researchers joining in the study, the research on video synopsis task has made new progress. Ahmed et al. [29] give a method to generate video synopsis by grouping tracklets of moving objects using spatio-temporal relation which is meaningful. After that, they come up with another method by adding three layers CNN to classify the objects in video to generate synopsis according to a query given by users, and the classified video synopsis is realized in their paper [30]. Some studies focus on object rearrangement and collision handle: Nie et al. [31] handle the collisions by hanging the speed or size of the objects, and develop a Metropolis sampling algorithm to find the solution for three-variable optimization problem. Ruan et al. [32] propose a dynamic graph coloring algorithm to efficiently rearrange all tubes by determining when they should appear, and they use online optimization in the method. In this paper, we also provide a method which realizes the classify video synopsis, the difference from the work in [30] and that in this paper is that we use the labels generated by CNN-based detector itself to classify objects instead of adding an extra network to do classification work, and it reduces the amount of additional neural network design work and computation of the algorithm.

B. CNN-BASED OBJECT DETECTION AND OBJECT SEGMENTATION
Object Detection and Object Segmentation are two main tasks of computer vision. Both tasks can be used to identify areas of interest from images or videos. The early object detection methods distinguish ''foreground'' of interest and ''background'' of no interest by generating a mask, which is not much different from the current object segmentation task. However, with the development of deep learning, the term ''object detection'' tends to describe the technology that uses bounding box to identify and classify targets, while ''object segmentation'' implies to classify each pixel ins an image as a foreground object.
Traditional object detection methods, such as HOG [33], SIFT [7] and DPM [34] detector, usually use hand-crafted features to detect objects. With the rise of convolutional neural networks(CNN), CNN-based object detection method has become the most mainstream research direction in this field. CNN is a useful filter which can learn robust and multi-levels features of the input photo. CNN-based object detection method can be divided into two directions: two-stage detection and one-stage detection. The former uses anchor calibration and object classification, while the later uses the method completely in on step [9]. Regions with CNN features network(R-CNN) [35] is the most representative algorithm of two-stage methods. It produces object candidate boxes by using selective search [36] at first, then extracts features of boxes by CNN and classification. Under the guidance of this idea, R-CNN develops a variety of improved versions, among which the best one is Faster-RCNN [37], whose main contribution is Region Proposal Network(RPN). RPN uses a CNN to replace selective search. Faster R-CNN is the first end-to-end deep learning detector (VOC2007 mAP = 73.2%, VOC2012 mAP = 70.4%), and it breaks through the speed bottleneck of R-CNN. However, two-stage detectors are computation redundancy, and it leads to the low detection speed.
Different from those two-stage detectors, one-stage CNN detection methods detect an image at once, so they have faster speed. YOLO [38] is a typical one-stage method. It uses one network both to divide proposal region and calculates the probability of belonging to a certain class. The speed of detection is rapidly improved due to the reduction of computation. However, the increase in speed comes at the expense of accuracy. Single Shot MultiBox Detector (SSD) [39] introduces the multi-reference and multi-resolution detection techniques to improve the accuracy of one-stage detectors. A loss function of object detection called ''focal loss'' and a RetinaNet network have been given out which solve the imbalance of the classes in the task [40]. Focal loss and RetinaNet make the one-stage detectors achieve the same ability as two-stage detectors in accuracy and keep the merit of high speed while detecting as a one-stage method. Therefore, this framework VOLUME 8, 2020 is a suitable component to perform video synopsis task. Section III introduces the theory of RetinaNet in detail, and we refer to the backbone of RetinaNet and merge it with attention mechanism in order to achieve a higher accuracy.
Semantic segmentation is a computer vision task with a higher semantic level than the ''object detection''. It gives the classification of objects as each pixel in the image. Specifically, the input image is converted into a mask with highlighted areas of interest. When using CNNs for semantic segmentation, the encoder-decoder structure is commonly used, and the most classical structure is Fully Convolutional Network (FCN) proposed by Long et al. [41]. It uses deconvolution to upsampling the feature. Besides, DeepLab [42] puts forward atrous convolution, and the authors add Conditional Random Field (CRF) at the end of the network. Related techniques of semantic segmentation are also used in more refined object detection, such as Wang et al. [43] use FCN and proposed deep video saliency network which consists two modules to capture the spatial and temporal saliency information in order to realize video saliency training and detection.
On the basis of semantic segmentation, video segmentation has rapidly developed. In recent years, more and more studies have been done on video segmentation. According to the difference of supervision required, the conventional video segmentation techniques can categorize into unsupervised, semi-supervised and supervised methods. Among them, unsupervised and semi-supervised methods are the hot research directions. These two types of methods can split the object of the video clip with less guidance information. The semi-supervised method uses the object segmentation information of the first frame, while the unsupervised method has no prior information. Caelles et al. [44] propose OSVOS model, which realized semi-supervised video target segmentation by training CNN serval times on different datasets and combining with pixel-wise sigmoid balanced cross entropy loss function. The MaskTrack model is developed by Perazzi et al. [45]. It is also a semi-supervised method by combining off-line and on-line training strategies and using temporal information of the video to guide training. Wang et al. [46] come up with ''super-trajectory'' to represent a group of compact point trajectories which has similar appearances and close spatiotemporal relationships to guide to propagate the initial annotations in the first frame onto the remaining frames accurately. Ventura et al. [47] use ConvLSTM module and design a encoder-decoder module, incorporating recurrence on spatial and temporal domains to realize a video object segmentation network for semi-supervised and unsupervised tasks. Song et al. [48] design pyramid dilated convolution module and pyramid dilated bidirectional ConvLSTM, and they cascade them to realize video salient object detection task.

III. PROPOSED METHOD
In this section, we provide all components and methods we used in this paper.

A. THE METHOD FOR IMPROVING OBJECT DETECTION ACCURACY OF VIDEO ABSTRACT
Extracting the moving object in the video is the first step in video synopsis. When using surveillance camera, the video will inevitably have jitter and illumination change [49]. Tradition methods cannot exact object clearly because the fuzzy and exposure in the video impact the result of the traditional object foreground detectors. Fig. 2(b) shows that tradition methods cannot identify moving objects or shadow changes. Currently, CNNs are the most researched algorithms in image analysis. When the feature pictures are input, CNNs can preserve spatial relationships in the pictures by filtering operation. CNN can detect objects completely through training on the specific datasets and not affected by illumination change or shadow. Fig. 2(c) shows the result of object detection using CNN-based method.
It is noted that in this paper we use bounding box based object detection method instead of object segmentation method. We do this after considering the following points: First, video in a single scene has little change of video background, so when detecting the object with bounding box based object detection method, the residual background pixels in boxes lead to little noise for the entire synopsis video. Second, based on the current research, bounding-box based object detection method has a higher accuracy than object segmentation method, and the use of object segmentation method will lead to the incomplete object exaction, which will affect the quality of synopsis video. Third, from the perspective of model training, the production of the custom dataset of bounding box object detection method is easier than that of the object segmentation method, and the computation of the former is smaller than that of the latter.
Based on the above considerations, we use a RetinaNetbased object detection method to exact foreground. Reti-naNet is a one-stage CNN object detection algorithm which use ''Focal Loss'' to improve the imbalance of foreground-background during training. Although two-stage approaches like R-CNN obtain higher accuracy, but the speed of detection is low because they use anchor selection to generate candidate region, and a lot of time is spent when generate synopsis video. RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of state-of-the-art two-stage detectors like Faster R-CNN. The focal loss is defined as: In (1), p t is defined as: where α, β are hyperparameters in the model. Focal loss solve the imbalance of positive/negative class and easy/hard examples by modifying the traditional cross entropy loss function. The authors of focal loss argue that class imbalance is the primary reason caused lower accuracy of one-stage object detection methods than two-stage method, and focal loss makes up for this shortcoming [40]. The accuracy of object detection is improved while the speed is guaranteed.
To further improve the accuracy and avoid the condition of object miss, we incorporate the visual attention mechanism into RetinaNet and rename it as ''attention-RetinaNet''. Visual attention mechanism is a method to imitate human visual attention, it guides the feature extraction model to focus the valuable part of the picture and ignore the invaluable part. Visual attention mechanism can be used for a variety of computer vision tasks, such as object detection [50] and video object segmentation [51]. Two main aspects of attention mechanism are proposed for feature extraction: channel attention mechanism and spatial attention mechanism. The nature of using attention mechanisms to improve object detection accuracy is that attention mechanism enables the CNNs to focus on the more informative features of the image. In this paper, we use two different attention modules to transform the raw RetinaNet structure: CBAM module and SE-Layer.
Convolutional Block Attention Module(CBAM) [52] exploits both channel attention mechanism and spatial attention mechanism to improve the performance of feature extraction., and has two sub-modules named ''channel attention module'' and ''spatial attention module'', Fig. 3 shows the whole structure and the schema of each module in CBAM. CBAM can be summarized as following formulas: where F in ∈ R H ×W ×C is the feature picture produced by feature exact network such as ResNet [53] or Inception [54]. CA(·) represents channel attention module and SA(·) is spatial attention module. Channel attention module in CBAM emphasizes the importance of max pooling and the authors combine it with average pooling in the CA(·) function. CA(·) function is given by the following formula: where AvgPool(·) presents average pooling, MaxPool(·) means max pooling, δ refers to ReLU function, σ refers  to Sigmoid function, A 1 ∈ R C r ×C , A 2 ∈ R C× C r , r is the reduction ratio. Then, a SA(·) function is operated on the output of CA(·) function: where f 7×7 represents a convolution operation with the filter size of 7 × 7. CBAM learns what and where to emphasize or suppress, and refines intermediate features effectively [52]. Squeeze-and-Excitation(SE)-Layer was proposed in the study of [55]. Contrary to CBAM module, it only emphasizes the importance of channel attention mechanism. The schema of the SE-Layer is shown in Fig. 4. It can be seen that this structure is similar to the channel attention module of CBAM module, but in the SE-Layer, only average pooling is used VOLUME 8, 2020 instead of maxpooling. The author of CBAM module argue that maxpooling gathers the information about distinctive object features while average pooling aggregating spatial information [52]. Since the structure is similar, the mathematical expression of SE-Layer will not be repeated here.
Both SE-Layer and CBAM module are lightweight modules, that means they do not cause too much computational cost. These two ideas inspire us to make a happy marriage between RetinaNet and attention modules. Hence, we design an RetinaNet-based network to perform the object detection task in video synopsis. The overall framework of our approach is shown in Fig. 5.
The framework has four main components. First, we separately add two kinds of attention modules mentioned above to modify residual module in ResNet which is responsible for computing a convolutional feature map over the transformed image. ResNet produces features at different scales (c 3 , c 4 , c 5 , p 6 , p 7 ). The first three feature graphs (c 3 , c 4 , c 5 ) are passed through FPN [56], then the obtained feature graphs and the last two feature graphs generated by ResNet are combined to build the feature pyramid (p 3 , p 4 , p 5 , p 6 , p 7 ). The reason for using FPN is that multi-scale features can be used to detect objects of different scales, thus obtaining higher accuracy [40]. Afterwards, we use box subnet and class subnet given in [40] to predict anchor boxes and classes of objects. The box subnet has four 3 × 3 convolution layers and ReLU layers, followed by a 3 × 3 convolution layer with (class × anchor) channels. The class subnet is similar, but the last convolution layer is (4 × anchor) channels instead. In Section IV, we compared the effects of two attention modules on the performance of the above framework through experiments, give the experiments process and the analysis of the results.

B. TRAJECTORY OPTIMIZATION FOR VIDEO SYNOPSIS
In stage of activity rearrangement, the balance between video compression and moving object collisions should be considered. Since multiple objects in different video frames are moved in same frame, moving object collisions is inevitable. These collisions will lead to chaos in the scene of synopsis video. Usually, a clustering-based track combination optimization method is used to solve the problem. The method defines an energy function to represent conflict and lost in spatial and temporal domain [5]. Optimization algorithms such as simulated annealing algorithm are used to optimize the energy function for spatiotemporal position planning of moving objects. Some other algorithm like ''traffic light'' algorithm [57] let one object stop moving and wait for another object to avoid collision in synopsis video, but it can lead to track interruption which can make people misunderstand that whether the object is moving or static. Table 1 gives an overview of existing research of collision handling for video synopsis, we group the methods with similar idea, then we compare characters of each group.
Inspired by the idea of Semi-Transparency-Handling Collision(STHC) method [49], we use a transparent strategy to handle collisions, and propose Local Transparency-Handling Collision(LTHC) algorithm. The researchers of [49] use semi-transparent to handle collision, but when two moving objects collide, they suddenly change from visible to translucent. In particular, when a very small object or a misdetected ''object'' (which may actually be the background) collides with another large real object, it can create a large area of transparency in the moving object, and it might affect the visual perception for person. To that end, we improve the equation of STHC as follows: where in 1 and in 2 are two objects exacted by detector, p · | (x,y) is the pixel value at position, out is the result of our method.  Algorithm 1 gives the detailed description of our LTHC method. In Fig. 6, we provide a schematic diagram of different transparent strategy to handle collisions. (a) is collision with no transparent. (b) uses the global transparency in [49], and (c) uses the our LTHC algorithm. It shows that there is severe occlusion in (a), which makes it impossible to display the object (the red car in the image). (b) the occlusion part can be shown, however, the rest of the part is not clear due to the use of translucent. (c) let the occlusion part be transparent, while the other part keeps the original pixel value, so as to solve the collision occlusion while not affecting the clarity of other parts.

Algorithm 1 Local Transparency-Handling Collision Algorithm
Input: Exacted objects sets T = {T 1 , T 2 , . . . , T n }. Output: Synopsis video 1: for i = 0 to n do 2: for all p in 1 in T i do 3: for all p in 2 in jth frame do 4: Calculate p out using 7; 5: Stick p out into jth frame; 6: Write jth frame into output video file; 7: Set p out ← p in 2 in jth frame 8: end for 9: end for 10: end for

IV. EXPERIMENTS
We use the optimization methods of video synopsis proposed in this paper. First we present the experimental results of attention-RetinaNet. Second we analysis the result of video synopsis with LTHC. What's more, we use attention-RetinaNet and LTHC to realize a whole classified video synopsis system.

A. EXPERIMENTS OF ATTENTION-RetinaNet
In order to train our model and compare the performance of the object detectors, we do a series of comprehensive experiments on different datasets. We perform experiments on a server with two Intel(R) Xeon(R) E5-2650 v3 2.3GHz CPUs and an NVIDIA Tesla M60 GPU with an 8GB memory. All of our experiments are implemented in Python with PyTorch.
We use the pretrained ResNet-50 as the backbone of the RetinaNet and adopt the following parameters in the RetinaNet-based models: In (1), α = 0.25, β = 2. In (5), the reduction ratio r = 16. In the training, batch size is 6, total epoch number is 100, and use SGD optimizer with learning rate is 0.001 with the momentum of 0.9 and weight decay with L2 regularization set the parameter of 5 × 10 −4 . We re-scaled the images and resized the short side to 600. In NMS module, we set the threshold as 0.45.
In the first experiment, in order to compare performance of four detection models as YOLO v3, RetinaNet, RetinaNet with CBAM, and RetinaNet with SE-Layer, we performed on the public datasets, PASCAL VOC 2007. The result is shown in Table 2. We calculated the mean average precision (mAP) and average precision (AP) of each class to help us understanding the impact of attention module. The attention module enhances the feature contrast of CNNs, and the improved feature can get effective result of accuracy of object detection. Comparing the two attention modules, because CBAM uses a combination of spatial and channel attention mechanism, it shows a greater ability to improve the accuracy of object detection than that of SE-Layer.
Further, we design a dataset within about 5,000 pictures, which because of the next video synopsis experiment.  The data come from the video clips in a traffic road video data set [60], which contains serval videos from traffic surveillance cameras. Video frames are extracted from these video data and the object detection data are marked. We tagged the marked objects into three classes: ''ca'', ''truck'', and ''bus''. The data is divided into training set and verification set by a ratio of 7 : 3. The number of object of each class in the training set are shown in the Table 3. Meanwhile, these objects have different levels of size. During detection, objects of different sizes can be detected and classified. Table 4 shows the results of different object detection network with our designed dataset. In this part we compare three models: RetinaNet, RetinaNet with CBAM, and RetinaNet with SE-Layer. The first row represents the result of raw RetinaNet on the dataset. The second row shows the result of the RetinaNet with CBAM module. The last row shows the result of RetinaNet with SE-Layer. We can get similar conclusion as we get in the first experiment. Table 5 shows the speed of each method when detect on videos with the size of 1920 × 1080. It could be seen that although the attention mechanism increased the computational complexity of the model, the introduction of the two modules did not reduce the speed too much. Even with the  addition of the CBAM module, the detection time of each frame was still more than 3 fps, and the accuracy improved by 4.87%.
The visualization of the object detection shows in Fig. 7. Column (a): the input images, (b): the detection result of raw RetinaNet, (c): the detection result of RetinaNet with CBAM module, (d): the detection result of RetinaNet with SE-Layer. In this figure, we choose three typical objects to show: the first row represents incomplete objects in the frame. The second row is a lot of objects of different sizes. The last row shows the big object in one frame. Through our experiment, we found that RetinaNet with CBAM module has the following two advantages: First, for incomplete objects, this structure reflects better detection results (see the first row of Fig. 7). Second, for smaller targets, the network can ''notice'' and detect them, thanks to the use of the attention mechanism (see the second row of Fig. 7).

B. EXPERIMENTS OF LTHC METHOD
For trajectory optimization, LTHC algorithm can improve the compression ratio. While using LTHC method, two different objects can be seen in the same position at the same time, and the overlap does not block each other. We conduct our method in (7) with γ = 0.5. Fig. 8 shows a collision between two objects: two cars happen collision in the synopsis video. The two vehicles from different frames in the original video  (so they occur at different time), the collision not really happens in the original video but happens when generate synopsis video. Column (a) is the result of our LTHC method and column (b) is the result of STHC. We find that our method effectively solves the problem of unclear object in synopsis video.
Without the use of complex optimization algorithms, the compression ratio of videos is also greatly increased. We calculate the compression ratio (CR) by the following equation: where t original refers to the duration of the original video and t output is the duration of the output synopsis video. The result of the compression ratio is shown in Table 6. Also, we compare the compression ratio using method in the study of [59] and the result is in Table 7. It can be seen that our method can get a higher compression ratio. VOLUME 8, 2020

C. THE IMPROVEMENT EFFECT OF THE METHODS PROPOSED
Since CNN-based object detectors classify the objects at the same time, it is feasible to realize classified video synopsis in our method. We compare the effects of previous video synopsis method with ours, and the effects show in Fig. 9. Fig. 9(a-b) is the original video frame, (c) represents the previous method and (d-f) shows ours. We have improved in the following three aspects: First, the use of RetinaNet-based object detection method keeps the integrity of detected objects. The synopsis video generated by the traditional object detection algorithm will be incomplete and the light changes will be detected as moving objects by mistake, resulting in the low quality of the synopsis video (show in (c)). However, our method can greatly reduce the impact of this problem. Second, the labels of objects can help us generate video synopsis of specific types of objects. (e)(f) are shown the result when car or truck is specified. Third, the LTHC method let the object more clearly in the synopsis video than that of STHC method.

V. CONCLUSION AND FUTURE WORK
In this paper, in order to get a better synopsis video, we proposed two methods. First, attention-RetinaNet object method, Second, LTHC method. These two methods improve the quality of video synopsis in object detection and activity rearrangement respectively. CNN-based one-stage object detection algorithm has been applied to the field of video synopsis. Also, STHC algorithm is improved as LTHC algorithm, and the unclear object problem is improved under the premise of ensuring the high compression ratio of the original algorithm.
In the future, we will analyze how to raise the speed of detector based on CNN to accelerate the video synopsis task. Also, the parallel computing and big data technology can be considered for this task to reduce the time spending on the video synopsis task and the cost of storage space in the computing system. XIANRUI LIU is currently pursuing the master's degree with the School of Computer Engineering and Science, Shanghai University (SHU). His current research interests include video synopsis, video understanding, big data, and cloud computing.
YIYONG HUANG is currently pursuing the master's degree with the School of Computer Engineering and Science, Shanghai University (SHU). His current research interests include action recognition, video captioning, and video synopsis.
CONGCONG ZHOU is currently pursuing the master's degree with the School of Computer Engineering and Science, Shanghai University (SHU). His research interests include image processing and deep learning.
HUAIKOU MIAO (Member, IEEE) is currently a Professor of computer engineering and science with Shanghai University (SHU), China. His research interests include formal methods and software engineering.