Movie Tags Prediction and Segmentation Using Deep Learning

The sheer volume of movies generated these days requires an automated analytics for efficient classification, query-based search, and extraction of desired information. These tasks can only be efficiently performed by a machine learning based algorithm. We address the same issue in this paper by proposing a deep learning based technique for predicting the relevant tags for a movie and segmenting the movie with respect to the predicted tags. We construct a tag vocabulary and create the corresponding dataset in order to train a deep learning model. Subsequently, we propose an efficient shot detection algorithm to find the key frames in the movie. The extracted key frames are analyzed by the deep learning model to predict the top three tags for each frame. The tags are then assigned weighted scores and are filtered to generate a compact set of most relevant tags. This process also generates a corpus which is further used to segment a movie based on a selected tag. We present a rigorous analysis of the segmentation quality with respect to the number of tags selected for the segmentation. Our detailed experiments demonstrate that the proposed technique is not only efficacious in predicting the most relevant tags for a movie, but also in segmenting the movie with respect to the selected tags with a high accuracy.


I. INTRODUCTION
The huge amount of multimedia data generated these days makes it an ordeal to envisage techniques which can automatically check the contents of multimedia data to ascertain their authenticity and classify them accordingly. Especially, retrieval of required information from multimedia data and assignment of appropriate tags largely depends on manual processing. Hence, the quality of the assigned tags follows a subjective criterion and varies from person to person. Our preliminary experiments [1] in this regard demonstrate that human-generated meta data can not suffice to give full insight into the main contents of a movie and/or shows inconsistency due to the lack of precision in human's ability of information recall. In addition, manually-generated semantic tags are less The associate editor coordinating the review of this manuscript and approving it for publication was Quan Zou . accurate and present irregularities. Our preliminary experiments on this topic further reveal that this ostensibly trivial task entails an intelligent analysis of a video to predict its representative tags without human intervention. This automatically extracted information has immense applications in optimizing video search, automatically retrieving scenes from videos based on user's query, object detection and localization, automatic text/subtitles generation for videos, detecting specific events in videos, action recognition, behavior recognition, recommendation systems, etc. Among these applications, scene-driven retrieval is particularly important in the sense that it not only helps in content-censorship (e.g., automatically censoring the scenes containing nudity, sex, violence, smoking, etc), but also in on-demand retrieval of desired scenes from a given movie (e.g., making highlights of a soccer match which contain all the goal events). At the same time, scene-driven retrieval is equally important for video summarization, e.g., removing all boring or unwanted scenes.
Segmenting a movie in the major constituent topics does require a precise identification of these topics in the first place. This information can then be used for on-demand scene selection, content-censorship and other tasks. In this paper, we address the problem of predicting key information from a movie in the form of small number of tags which describe the overall contents of the movie. For this purpose, we aim not only to understand the semantic meaning of individual movie frames, but also to predict a compact set of the movie's representative topics. The predicted information can be utilized in a number of ways such as movies categorization, context-based search, content-censorship (e.g., nudity, violence and sex in kid movies). Apart from this, the predicted tags for a movie can be further utilized to segment the movie according to the user's choice.
Applying the traditional approaches of object detection on the individual movie frames will be inefficient as we will end up with low-level information (e.g., the localization of objects in the individual frames instead of the contextual relationship among them). In addition, processing each movie frame will also introduce computational inefficiency and result in redundant information. This problem can only be addressed by a machine learning algorithm which can be trained to predict the high-level, representative features of movie scenes. The tremendous advancements in the field of machine learning have paved the way for finding patterns in complex data with an accuracy which, in some cases, even surpasses human's pattern matching performance. The cheap and scalable parallel processing technique utilizing Graphical Processing Units (GPUs) have made possible to efficiently apply machine learning techniques for image/video analytics [2]. That said, applying machine learning to learn the traditional image features for our problem is not efficient due to the well-known issues related to these features such as requirement of mathematical modeling, limited generalization, scale-and rotation variance, inability to maintain performance under different conditions, etc.
Instead of learning the hand-crafted image features, it is more efficient to discover the underlying features in the individual movie frames. Deep learning [3] can serve this purpose, as it does not require a priori information of image features. Instead, it learns the underlying patterns in complex data during training. Apart from this, a deep learning model trained on a large dataset can be retrained using transfer learning [4] for a different classification task with a much smaller dataset and training time. Considering the promising features of deep learning, we formulate the problem of movie tags prediction as a deep learning based classification. For this purpose, we first develop a tag vocabulary in which each tag represents a class. We further develop a dataset corresponding to each tag in the vocabulary. Subsequently, we transfer the features of a pre-trained Convolution Neural Network (CNN), Inception-V3 [5], for our training task by modifying and re-training its final layer using transfer learning.We further propose an efficient shot detection technique for determining the key frames in a movie which are later used for analytics by the deep learning model. The proposed shot detection technique is able to detect the hard-cut, fade-in and fade-out shot boundaries.
Once all the key frames in a movie are found, their CNN features are computed and fed to the newly added final layer of the trained model to get the corresponding predictions for each frame. The predictions corresponding to each frame are assigned certain scores and relative weights based on the values of their prediction probabilities and predominance in the movie. The tags having smaller relative weights are dropped out to obtain a compact set of few prime tags which best represents the overall contents of the movie. Discarding the motion information by processing only the key frames of a movie does not have a considerable impact on the prediction accuracy, as the recent related work in this domain reveals that motion features do not drastically impinge on this task [6]- [8].
The key frame analytics by the deep learning model further generates a corpus which is used to segment a movie with respect to a selected tag. The corpus contains the details of each shot's boundary, its key frame, the predicted tags for the key frame, and the spatio-temporal details of each shot. We further analyze the segmentation performance with respect to the number of predictions per key frame selected for the segmentation. The results presented in the paper show the tradeoff between the precision and recall of the segmentation.
The conspicuous features of the work presented in this paper can be summarized as follows.
• The proposed technique of movie tags prediction operates at a higher semantic level by seizing the overall context in the extracted key frames of a movie. The context denotes the semantic meaning of the inter-objects relationship in a scene, for instance, violence, action, romance, fight, etc.
• This work stands apart from the traditional event/scene recognition approaches where each item is adherent to a single event or scene.
• Our proposed technique is also in contrast with the traditional object recognition techniques which targets to localize and label every individual object in an image. This will result in a highly redundant and inconsequential information for our task.
• This work does hold resemblance with genre classification of movies. However, a movie typically has 2-3 genres which do not encompass the entire range of a movie's contents (e.g., nudity, sex, smoking, violence, etc). In contrast, our carefully designed, flexible and scalable tag vocabulary sufficiently covers the main theme of a movie.
• Unlike the existing video segmentation techniques which perform little to no semantic analysis and mostly exploit the visual similarity of the shots to merge them into non-overlapping scenes, we do consider a shot's semantics to label and categorize it accordingly. The rest of the paper is organized as follows. Section II provides an overview in the domain of movies/videos tags prediction and segmentation. Section III provides a brief theoretical background of convolution neural networks and transfer learning. Section IV describes our movie tags prediction and segmentation algorithms in detail. In Section V, we discuss the experimental setup and evaluation results of movie tags prediction and segmentation. Section VI concludes the paper.

II. RELATED WORK
We believe that movie tags prediction and segmentation has not been well-studied in the literature. The related work in this domain is primarily targeted at video tagging on a limited scale. Qi et al. [9] annotated certain concepts in a video using multi-label classification and the inter-class correlation. Another video labeling approach proposed by Siersdorfer et al. [10] put to use the redundancy among YouTube videos for finding associations among videos and assigning tags to similar videos. The techniques proposed by Shen et al. [11] and Liu et al. [12] made use of the data captured by the smartphone sensors to generate video tags. In Miranda-Steiner [13], the proposed technique identified basic objects in the images and videos of a digital camera and further exploited the geographical and date/time information to predict the relevant tags.
Ulges et al. [14] first found the key frames from a video and then predicted several visual features for each key frame. The visual features were assigned scores which were later fused to generate a final probability for a certain tag in the video. The tagging performance in this approach largely depended on the feature modalities and thus had limited accuracy. Chen et al. [15] proposed a video tagging technique which first found all the textual descriptions of a video from Internet sources and a graph model was applied on the descriptions to discover and score the key words serving as tags. This technique was dependent on the human-generated textual description. In a similar technique, Zhao et al. [16] first found similar videos by local features. The tags from the similar videos were analyzed to pick the most relevant tags for the given video. It is palpable that this technique shared the same limitations as in [15]. Some techniques proposed by Aradhye et al. [17], Toderici et al. [18] and Yang and Toderici, [19] did not solely rely on the user-supplied tags, but also took into account the audiovisual features to train classifiers based on the correspondence between the contents and the user-annotated tags. Nevertheless, the incorporation of inconsistent user-supplied meta data introduced the aforementioned issues.
A large part of the relevant literature in this context relied on the user-annotated meta data. A similar technique proposed by Chu et al. [20] first searched for the images on Flickr that has similar tags as those of the given video. A bipartite graph was used to describe the relationship between the key frames of the video and the tags associated with the images. The technique proposed by Acharya [21] selected one or more user-generated tags for a video which described its category. A transcript of plurality of words was generated along with their respective ranks. Based on the ranking of the plurality of words, one or more tags were generated. Chen et al. [22] proposed a web video topic detection technique. This technique utilized the video related tag information to determine bursty tag groups based on their co-occurrence and temporal trajectories. The near-duplicate key frames predicted from the web videos were fused with these tag groups. Subsequently, the fused groups were further matched with the keywords obtained from the search engine to find the topics.
Some techniques [23], [24] used the plot synopses and summaries of movies to predict a set of tags or movie genres. These techniques required plot synopses of movies which are not always readily available with a movie. In addition, the dataset was comprised of manually curated tags which share the same aforementioned limitations.
Ullah et al. worked on video semantic segmentation for pedestrian flow and crowd behavior. In [25], they identified crowd behaviors from a video sequence using a method based on thermal diffusion process and social force model. In [26], they employed a recurrent conditional random field using Gaussian kernel features to segment anomalous entities in pedestrian flows by detection and localization. In a recent work [27], they proposed the hybrid social influence model for pedestrian motion segmentation by using a particle representation and modelling the influence of particles on each other. However, these methods are domain-specific and focused on segmentation inside the images of the frames.
Due to the recent breakthroughs and advancements in deep learning, the research on semantic analysis of images and videos has been diverted to use complex neural architectures to learn hierarchical feature representations. The hot research areas in this domain include converting visual data to textual representation [28]- [31], answering questions from videos [32], [33], and video classification [7], [34]- [36]. The first two areas are different from tags prediction, as they entail more sophisticated architectures such as recurrent neural network [28] in combination with Convolutional Neural Network (CNN) to discover the spatio-temporal connection between consecutive video frames. On the other hand, video classification does hold resemblance with video tagging, but it is mainly focused on predicting the major category a video falls in, rather than predicting an extended set of classes pertaining to a given video.
A number of video segmentation techniques have been studied in the relevant literature. Majority of these techniques use a common approach: finding the shot boundaries and merging the shots into uncategorized segments (scenes) based on their visual similarity. A shot is an elementary structural segment that is defined as a sequence of images taken without interruption by a single camera [37]. Rasheed and Shah [38] clustered the shots based on their color similarity and VOLUME 8, 2020 found the segment boundaries based on the shot lengths and the motion contents. Some techniques [39]- [41] addressed the video segmentation by constructing a shot similarity graph using the color and motion information and subsequently segmenting the video by graph partitioning. Some shot clustering techniques [42], [43] also used Markov chain Monte Carlo technique for detecting segment boundaries, albeit the segments were uncategorized. In another shot clustering approach, Chasanis et al. [44] applied a sequence alignment algorithm to detect the change in pattern of shot labels to determine segment boundaries. In another technique, Chasanis et al. [45] first found the local invariant descriptors of the key frames of all the shots and grouped them into clusters. Each cluster was treated as a visual word. The histograms of visual words were smoothed using Gaussian kernels whose local maxima represent the segment boundaries. In a different approach, Hoai et al. [46] augmented video segmentation with action recognition. They first trained a recognition model using multi-class SVM on a labeled dataset. The segmentation and action recognition was done simultaneously using dynamic programming. This was the first approach of video segmentation with segment categorization, though in the form of limited number of action recognition. However, it did require the engineered image features for training the action recognition model. Some approaches combined audiovisual features for scene segmentation. Sidiropoulos et al. [47] addressed this issue with a semantic criterion by exploiting the audiovisual features of the key frames to construct multiple Scene Transition Graphs (STGs) [48]. A probabilistic merging process combined the results of the STGs to detect segment boundaries. In a similar approach, Bredin [49] extended this idea by combining speaker diarization and speech recognition with visual information. A drawback of these techniques is that the STGs exploit low-level visual features and provide no margin for augmenting heterogeneous feature sets. In addition, the heuristic settings of certain STG parameters are also required.
In another technique, Baber et al. [50] used frame entropy to find shot boundaries and determine the key frames of the shots. Afterwards, the SURF features of the neighboring key frames within a window were matched to determine the scene boundary. In a later approach Baber et al. [51] the histogram of visual words for each shot were computed. The distance between the visual word histograms was calculated to merge the shots which are closer in space.
In a more recent approach, Yanai et al. [52] found the relevant shots from web videos based on the given keywords. This technique first searched for the relevant web videos by matching their human-generated tags with the given keywords. It then segmented the selected videos into shots and ranked them according to the similarity of visual features. The top-ranked shots represented the shots of interest.
A detailed literature review in this domain reveals that video tagging and segmentation has not been studied in combination. Whereas the existing techniques of video tagging either depend on hand-crafted image features and user-annotated meta data or do not provide an extended set of the thematic points of a movie, the semantic criteria in the video segmentation is largely ignored. The commonly used approach of matching the low-level visual and/or audio features of the successive shots (or their key frames) to determine the segment boundaries is too trivial to understand the semantic correlation among the shots. Additionally, segmenting and merging all the logical story units based on the semantic understanding of individual shots can not be efficiently done by low-level, engineered audiovisual features. Hence, segmenting a video into constituent topics, which can be later retrieved by a query, requires an intelligent semantic analysis of each shot. This is only efficiently possible by a deep learning based algorithm which does not require a priori knowledge of the low-level features.
We addressed this issue in a threefold approach: (i) we first proposed an efficient shot boundary detection algorithm which finds the representative key frames of all the shots in a movie, (ii) we trained a convolution neural network on a tag vocabulary to predict the context of each key frame and subsequently generating a compact set of the movie tags without requiring a priori information of image features or user-annotated meta data, and (iii) we offered an on-demand segmentation of the movie based on its predicted set of the tags. Using the semantic information provided by the movie tags, we eliminate the need of matching the low-level audiovideo features of the successive shots for segmentation. Our segmentation approach classifies a shot into a particular category based on its contents. Hence all the relevant neighboring shots can be efficiently merged into a particular category. In this way, our movie segmentation approach is the first to use the semantic criterion for segmentation.

III. CONVOLUTIONAL NEURAL NETWORK (CNN) & TRANSFER LEARNING
Contrary to traditional neural networks, CNNs have much higher number of hidden layers which are well-suited to discover the intricate patterns in complex data without a given mathematical model. Due to this appealing feature, the last decade has witnessed a tremendous potential of CNN for semantic analysis of images and videos. A CNN has four types of layers: (i) convolution layer extracts features from a given image using multiple filters, (ii) activation layer restricts the output of the convolution layer in a specified range and introduces nonlinear mapping and generalization in the learning process, (iii) pooling layer reduces the spatial size of the data which results in less number of parameters and computations, and (iv) classification layer outputs a probability distribution which contains the final score of each class. A CNN architecture may range from simple (having relatively smaller number of convolution, activation and pooling layers) to complex (having hundreds of layers). The deep architectures help CNN to discover patterns/features in complex data without a given mathematical model.
The deep architecture also offers a dedicated challenge of training a CNN from scratch, as it requires enormous computing power, incredibly long training time, and a huge training data. However, analogous to human learning, the knowledge acquired by a CNN pertaining to a specific problem is transferable to another problem [53]. As we move from lower level layers of CNN to higher level layers (in the direction of classification layer), the specificity of features increases until the final classification layer becomes entirely task specific. The image features extracted by the lower level CNN layers can be utilized to re-train the model for an entirely different task, eliminating the need of training the model from scratch. In this connection, all the layers of a pre-trained CNN model, except the final classification layer, can be used as fixed feature extractor. The final layer can be modified and re-trained for a new task, utilizing the knowledge obtained from the previous training. This method is called transfer learning which we use to re-train a CNN model, Inception-V3, trained on a large dataset (ImageNet 1 ). Although this CNN model has been trained for a completely different task, its features are effectively transferred to the task of movie tags prediction.

IV. MOVIE TAGS PREDICTION AND SEGMENTATION
To the best of our knowledge, there is no public dataset of movies containing labelled static scenes related to the training classes of our tag vocabulary. Hence, we first develop a tag vocabulary comprising of 50 movie tags. Subsequently, we construct a dataset for each tag by collecting the relevant features (movie frames describing the tag) from a number of movies. Table 1 shows our tag vocabulary which has 700 images pertaining to each tag. It is worth mentioning that some of the tags have overlapping features (e.g., violence, car chase, action, sword fight, etc) which makes this training problem tougher than the one in which classes share little to no features. Our tag vocabulary is scalable and evolving as we identify more relevant tags and collect the appropriate dataset. In order to construct the training dataset, we first formulate a criterion for finding the relevant training images pertaining to each tag. These images are collected as static frames from a number of movies. Table 1 also describes our semantic criteria of data collection for each tag which represents the required contents in an image describing a tag. From Table 1, it is also evident that we adequately cover the semantic contents pertaining to each tag by including its as many variants as possible.

A. TRAINING
The process of feature prediction with the pre-trained deep learning model and transfer learning with Softmax classification is depicted in Figure 1. We use Inception-V3 pretrained model to retrain it on our tag vocabulary using transfer learning. After modifying the final classification layer of Inception-V3 model for movie tags prediction, we use the rest 1 http://www.image-net.org/ of the layers as fixed feature extractor. A dropout layer [54] is further added as a penultimate layer to randomly discard the activations of 50% neurons during training to prevent the inter-neuron dependencies and lack of generalization. A smaller learning rate of 0.005 with larger sizes (500) of training and validation batches are used to obtain more stable results. The dataset is partitioned such that 80% images are used for training, 10% for validation, and 10% for testing. For each input, the output of the penultimate layer, after applying dropout, is calculated as follows, where W i,j ∈ R is the weight coefficient associated with j th and i th neurons, and b i represents the bias for i th neuron.
x j represents the j th activation of the feature map from the previous (convolutional) layer. Specifically, if this were the first hidden layer, x j would be the j th pixel of the input image. The Rectified Linear Unit (ReLU ) activation function is used to restrain the output in a specific range. It is linear (identity) for all positive values, and zero for all negative values.
The main purpose of an activation function is to introduce non-linearity and generalization in the training. Without an activation function, the CNN will be limited in its capacity to learn complex patterns and will behave akin to a linear regression model. The reasons of selecting ReLU activation function include its computational efficiency, smaller training time, faster convergence, and sparse activation. The output of the penultimate layer is converted into a probability distribution. For this purpose, Softmax classification [2] is used to calculate the tag probabilities by the following rule, where p i is the probability of i th tag in the tag vocabulary of 50 tags. p i can be interpreted as the (normalized) probability assigned to the i th tag. A cross-entropy error estimate [2] E(p, q) is used to calculate the difference between the predicted distribution p and the actual distribution q by the following rule.
The cross-entropy E(p, q) compares the model's prediction p(x) with the label which is the true probability distribution q(x). The cross-entropy decreases as the prediction gets more and more accurate. The Softmax classifier aims to reduce the error estimate between the predicted and the true distribution. We train the model for 50,000 iterations (500 epochs). The description of the CNN training parameters is given in Table 2. The smoothed graphs of the training/validation accuracy and cross-entropy estimate are shown in Figure 2a and Figure 2b. It is apparent that the training-validation gap in both cases is VOLUME 8, 2020   significantly reduced. It is also evident that the addition of dropout layer and the right selection of training parameters leads to a good generalization. The overall test accuracy of the model is 85%.

B. TESTING
The trained model is tested on the static frames of different movies for tags prediction. The overlapping among the tag features further allows us to consider more than one predictions in the probability distribution. While the tag having the highest probability represents the most dominating content in a movie frame, the other tags lower in the probability distribution may also reveal important information. The results of tags prediction for some movie frames are shown in Table 3. The tags appear in the order of decreasing probability with the first tag representing the highest probability. Although VOLUME 8, 2020 the other tags have smaller probabilities in the probability distribution, yet they still reveal the relevant information contained in the movie frames.

C. SHOT BOUNDARY DETECTION AND KEY FRAMES EXTRACTION
For semantic analysis of a movie, it is inefficient to analyze all the frames in the movie. Instead, we first find the representative frames of all the shots in the movie. We find the shot boundaries and select the middle frame of each shot as the key frame. In order to find the shot boundaries, we compute the intersection of the HSV (Hue, Saturation, Value) histograms of the successive frames. This gives us a measure of similarity of the two discretized probability distributions (HSV histograms) with possible value of the intersection lying between 0 (no overlap) and 1 (identical distributions). The advantage of using HSV color space is that it is not only more robust to light variations, but is also better with respect to human perception [55].
Our shot boundary detection algorithm can detect two major types of shot boundaries: (i) hard-cut, which represents an abrupt transition from one shot to another, and (ii) dissolve, which is a gradual transition from one shot to another. We use a sliding window, centered on the current frame, on the similarity values of n frames. In order to determine hard-cut and dissolve shot boundaries, we use two adaptive thresholds which are based on the statistical properties of the sliding window. The adaptive threshold performs better than a single threshold which can not compensate for all the variations of the shot.
We evaluate the degree of similarity S(i, i + 1) between the current frame i and the next frame (i + 1) of the sliding window for the hard-cut boundary by first computing the intersection of the HSV based histograms for i and (i + 1). After it, the minimum m 1 , second minimum m 2 , and mean µ of the similarity values within the window are calculated. A hard-cut is detected between frames i and (i + 1) if the following three conditions are satisfied, where α ∈ (1, 2). If the above conditions are not satisfied, we check for the dissolve boundary by the following rule, where σ is the standard deviation of the similarity values within the window. It is worth mentioning that in comparison with the sliding window used to evaluate the hard-cut boundary, a window that remains fixed on the left side but grows on the right side by one frame after each evaluation, performs better for detecting a dissolve boundary. The algorithm of shot boundary and key frame detection is depicted in Algorithm 1.  13: [1] 14: 15: if mid = m 1 and mid ≤ αm 2 and µ ≥ αm 1 then 16: shot_type = 'hard-cut' 17: else if mid ≤ µ(1 − σ ) + σ 2 then 18: shot_type = 'dissolve' 19: else 20: shot_type = 'none' 21: end if 22: Get the shot boundaries and key frame 23: Move the window to the next element in hist[ ] 24: end for

D. TAGS PREDICTION
After finding a shot boundary and picking a key frame from the shot, we check if the key frame contains reasonable amount of information. For this purpose, we convert the key frame to luminance/chrominance color space and calculate the entropy of each channel by the following rule [56], wherep(x i ) is the probability of a pixel x i to have a certain value. Only those key frames are selected for semantic analysis whose cumulative entropy is greater than a certain threshold (H > 0.20). After finding the key frames of a movie, we feed each key frame to the trained model to obtain top 3 predictions. Subsequently, we determine the weight W i of each tag by the following rule, where W i denotes the weight of i th tag, n i is the number of occurrences of i th tag, N is the total number of predicted tags, and P ij is the probability of j th occurrence of i th tag. The tag weights are further normalized in the range [0, 1] to calculate the relative strength R i of each tag by the following rule, (8) where W max and W min represent the maximum and minimum tags strengths in the set of all the predicted tags. The tags having relative strengths less than a certain threshold are dropped to get a fewer key tags which best describe the movie. Figure 3 depicts the overall approach.

E. MOVIE SEGMENTATION
The shot boundary detection and analytics phase produces a corpus which contains the following information: (i) key frame of a shot, (ii) start and end frame of a shot, and (iii) top three predicted tags for each shot or its representative key frame. This information is used to merge the related shots based on a user-selected tag from the set of the predicted tags, as shown in Table 4. We select the extended trailers and long clips of several movies and first predict the set of key tags for each video. Using the movie analytics corpus generated during the process of tags prediction, we can segment a movie with respect to a selected tag based on top 1, top 2 or top 3 predictions for each key frame. We can also segment a movie based on more than one tags simultaneously (e.g., sex + nudity + romance, action + technology + sports, etc).
For example, Table 4 shows the analytic results of a movie whose set of tags are t = {music, glamour/fashion, technology, family, college/university, club/bar, dance}. After getting the tag set, we can segment the movie with respect to any tag. For example, if we segment the movie for finding only the scenes of music, the segmentation algorithm will first find all the shots related to music in the shot description table (Table 4). A shot is described by its boundaries, i.e., start and end frame. After finding all the shots corresponding to the tag of music, the shots are merged into a separate movie file which represents the segmentation of the movie with respect to the tag of music.
The accuracy of the extracted segment corresponding to a user-selected tag does depend on the number of top predictions (1, 2 or 3) selected to merge the related shots into a segment. For example, the tag of music can be the top 1, top 2 or top 3 prediction for a shot. If we select only the top 1 prediction for the segmentation of music tag, we will select only those shots in which the music tag is top 1 prediction (relatively smaller segment with higher precision). On the other hand, if we select top 3 predictions for the segmentation of music tag, we will select all those shots in which music is top 1, top 2 or top 3 prediction (relatively larger segment with smaller precision).
We find F1-score to be a better measure of the segmentation performance, since there is a relationship between the  number of top predictions selected for the segmentation and the segmentation precision-recall. While the precision for small number of top predictions is higher with relatively lower recall, the opposite is true for higher number of top predictions selected for the segmentation. The results of this analysis are presented in Section V. Table 5 summarizes our experimental setup. We first find the shot boundaries and determine the key frame of each shot by the algorithm described in Algorithm 1. Figure 4a shows the detection of hard-cut shot boundaries which are represented by 'x'marks. These marks show the points where the conditions specified in equation 4 are satisfied. On the other hand, Figure 4b shows the detection of dissolve shot boundaries according to the conditions specified in Equation 5.

V. EXPERIMENTAL SETUP AND RESULTS
We first find the shot boundaries by watching all the movie trailers and noting down the start and end time of a shot. Using this information as a ground truth, we compare the shot boundaries with those determined by our shot detection algorithm. Table 6 shows the evaluation results of both hard-cut and dissolve shot boundary detection algorithm for various movie trailers of varying lengths. The size n of the window on the similarity values is taken as equal to the movie's frame rate (e.g., n = 24 for a movie having a frame rate of 24). It is evident that our shot boundary detection algorithm has impressive F1-score for both hard-cut and dissolve shot boundaries.
The average processing time for the whole algorithm, as shown in Figure 3, slightly varies with movie type. For the movies having more dynamic contents and consequently more number of key frames (e.g., action movies), the overall processing time is slightly higher. We evaluate the average time for the whole processing pipeline (including key frame  extraction, running inference on the key frames, and tags prediction) for 10 different movies. Using the experimental setup described in Table 5, the average processing time for a 720p resolution movie is 89 frames per second.
Since we do not have a ground truth for evaluating the performance of our tags prediction algorithm, we adopt a subjective criterion. Our subjective evaluation comprises 3 different experiments each performed on 10 different volunteers. For these experiments, we select a number of movie trailers of diverse categories. Not only a movie trailer best represents the whole movie, but it is also helpful to complete the experiments in reasonable time.
In the first experiment, the participants watched 50 movie trailers. At the end of each movie trailer, the set of its predicted tags were revealed to the audience and they were asked to judge its relevancy, accuracy and completeness by assigning it a score between 0 to 10 (0 being the worst and 10 being the best). A Mean Opinion Score (MOS) from the participants' feedback was calculated which is found to be 84.70%.
The second experiment was performed with different sets of participants and 50 different movie trailers. This experiment included asking the participants to rate the predicted tags after watching each movie trailer based on their relevancy as well as their relative strengths. This information was presented to them in the form of a visual chart as shown in Figure 5a. The MOS for this experiment was 79.20%. Figure 5b shows the predicted tags for the full length movie. It is evident that in both the cases, the sets of the predicted tags are similar with different relative strengths as there is a lot more information in the full length counterpart.
In the third experiment, performed with a different audience, the participants were handed over the whole tag vocabulary and were asked to watch a different set of 50 movie trailers. After watching each movie trailer, they were asked to point out appropriate tags for the movie trailer from the tag vocabulary. This experiment enabled us to calculate Mean Average Precision (MAP) P, Mean Average Recall (MAR) R, and F1-score by the following formulas, where t p (i, j), f p (i, j) and f n (i, j) represent the number of true positive, false positive and false negative, respectively, for the i th movie trailer and j th participant. Whereas, M and N represent the number of movie trailers and the number of participants, respectively. The MAP and MAR of this experiment were 76.50% and 74.55%, respectively, which gives a F1-score of 0.7551. It is pertinent to mention that the manual annotation used for training does have the aforementioned limitations. It is because the background, experience, maturity, age, and qualification of the annotators can not be determined by random collection of data to construct a training dataset. However, in our case, the experiments have been designed in such a way that they not only involve the participants with known background, but also cover a wide variety of the movies to ensure the completeness of the experiments. In addition, as opposed to the traditional methods, our training does not rely on the manual annotation. It is only used for the evaluation. Hence, our subjective evaluation using three different types of experiments suffices to ascertain the efficacy of the proposed algorithm.
The evaluation of our movie segmentation technique with respect to three selected tags predicted by our algorithm is shown in Table 7. As discussed in Section IV-E, we can use n = 1, 2, . . . , m predictions per key frame for movie segmentation with respect to a selected tag, where m represents the total number of tags in the vocabulary. Since we pick only 3 topmost tags in the probability distribution of each prediction, the maximum value of m is 3. Hence, we can segment a movie with respect to 1, 2 or 3 topmost tags for each key frame.
In order to evaluate the efficacy of our segmentation approach, we first find the segmentation ground truth for a number of movie trailers with respect to top 3 tags. For this purpose, we carefully watch the movie trailer and find the shot boundaries (start and end frame) for each tag. We then compare the ground truth with the individual shots for each  tag found by our segmentation technique. For each movie trailer, we compare the segmentation results with the ground truth for top 1, top 2 and top 3 predictions per key frame for each predicted tag.
Our evaluation shows an interesting relationship between the number of predictions per key frame used for segmentation and the precision-recall. Table 7 shows that as the number of predictions per key frame used for segmentation increases, the precision declines while recall increases. Nevertheless, this variation does not keep the F1-score same, as the change in recall is more abrupt than that in precision. Hence, the F1-scores for top 1, top 2 and top 3 predictions per key frame are 0.80, 0.85 and 0.88 respectively. While top 3 predictions per key frame give the highest F1-score, the trade-off between precision and recall further allows the user to segment a movie either with a higher precision (lower recall) or higher recall (lower precision) by selecting just one or higher number of predictions per key frame.
To the best of our knowledge, the relevant literature demonstrates no combined approach of movie tags prediction and the subsequent segmentation which can be used to compare the efficacy of our proposed technique. However, our detailed experiments suffice to demonstrate the performance of our proposed techniques.

VI. CONCLUSION
In this paper, we have proposed a movie tags prediction algorithm using deep learning. The predicted tags can be further used for segmenting a movie at the viewer's choice. Exploiting the powerful features of deep neural networks, we retrained a deep learning model (Inception-V3) using transfer learning to predict a class of a given movie frame from a carefully designed tag vocabulary. Subsequently, we proposed an efficient key frame detection algorithm which finds the representative frames of all the shots in a movie. Using the probability distribution of the prediction vectors generated by the final layer of the trained model for each key frame, we further proposed an algorithm which assigns weights to the predicted tags and finally produces a compact set of key tags which best describes the movie. The set of predicted tags can be further used to segment a movie using a corpus generated during the tags prediction algorithm.
Unlike the simple and limited approaches of movie tags prediction and segmentation studied separately or in combination in the literature, our proposed framework neither requires a priori knowledge of the tag features, nor is dependent on the user-annotated meta data which are major limitations of the techniques proposed in this context. In addition, our movie tags prediction and segmentation techniques are VOLUME 8, 2020 based on semantic analysis of the movie contents as opposed to the naive tagging and scene segmentation techniques studied in the literature.
We are also extending our tag vocabulary by identifying more classes and collecting the appropriate data. In future, we aim to extend our algorithm for audiovisual features. We believe that incorporation of audio features will further improve the performance of tags prediction for some classes (e.g., comedy, tragedy, etc) which are difficult to accurately predict using only visual features. In addition, we also aim to incorporate motion information in the prediction models by using recurrent neural networks which can better capture the dynamics of a scene by using a memory-based system. UMAIR ALI KHAN received the master's and Ph.D. degrees from Alpen-Adria University, Klagenfurt, Austria, in 2010 and 2013, respectively. Since then, he has been working as an Associate Professor and the Head of the Department of Computer Systems Engineering, Quaid-e-Awam University of Engineering, Science and Technology, Nawabshah, Pakistan. He has also worked in the Fraunhofer Institute of Integrated Circuits, Erlangen, Germany, and the Machine Perception Laboratory, Hungarian Academy of Sciences, Budapest, Hungary, as a Research Scientist, from 2016 to 2017. His research interests include context-based information retrieval from images and videos using deep learning.
MIGUEL Á. MARTÍNEZ-DEL-AMOR received the M.Sc. degree in computer engineering from the University of Murcia, Spain, in 2008, and the Ph.D. degree in computer science and artificial intelligence from the University of Seville, Spain, in 2013. In his dissertation, carried out at the Research Group on Natural Computing, University of Seville, he developed parallel simulators for bio-inspired models of computation using GPUs. From 2014 to 2017, he worked as an ERCIM Fellow and later as a Research Associate with the Moving Picture Technologies Department, Fraunhofer IIS, Germany, where he was involved in the standardization of JPEGXS format, parallelization of JPEG2000 codecs with GPUs, and deep learning applications in digital cinema. Since August 2017, he has been an Assistant Professor with the University of Seville. His main research interest is on the interplay between parallel computing, bio-inspired computing, and machine learning. SALEH M. ALTOWAIJRI received the Ph.D. degree in cloud computing from Swansea University. He has over eight years of research experience and has published several book chapters, conference, and journal papers. He is currently the Dean of the Faculty of Computing and Information Technology, Northern Border University, Rafha. His research interests include grid and cloud computing, database management systems, data mining, information systems, information technology risk management, and emerging ICT systems in healthcare and transportation sectors. He is a reviewer of several international conferences and journals.
ADNAN AHMED received the M.Eng. degree in computer systems engineering from QUEST, in February 2012, and the Ph.D. degree in computer science from UTM, in 2015. He is currently an Associate Professor with the Department of Telecommunication, QUEST, Nawabshah, Pakistan. His research interests include routing in ad-hoc networks, security and trust management in ad-hoc networks, and QoS issues in sensor and ad-hoc networks. In addition, he also works on image and video retrieval using deep learning. He is a professional member of Pakistan Engineering Council (PEC) and a regular reviewer of well reputed ISI-indexed journals.