Video Index Point Detection and Extraction Framework using Custom YoloV4 Darknet Object Detection Model

The trend of learning from videos instead of documents has increased. There could be hundreds and thousands of videos on a single topic, with varying degrees of context, content, and depth of the topic. The literature claims that learners are nowadays less interested in viewing a complete video but prefers the topic of their interests. This develops the need for indexing of video lectures. Manual annotation or topic-wise indexing is not new in the case of videos. However, manual indexing is time-consuming due to the length of a typical video lecture and intricate storylines. Automatic indexing and annotation is, therefore, a better and efficient solution. This research aims to identify the need for automatic video indexing for better information retrieval and ease users navigating topics inside a video. The automatically identified topics are referred to as “Index Points.” 137-layer YoloV4 Darknet Neural Network creates a custom object detector model. The model is trained on approximately 6000 video frames and then tested on a suite of 50 videos of around 20 hours of run time. Shot Boundary detection is performed using Structural Similarity fused with a Binary Search Pattern algorithm which outperformed the state-of-the-art SSIM technique by reducing the processing time to approximately 21% and providing around 96% accuracy. Generation of accurate index points in terms of true positives and false negatives is detected through precision, recall, and F1 score, which varies between 60-80% for every video. The results show that the proposed algorithm successfully generates a digital index with reasonable accuracy in topic detection.


I. INTRODUCTION
The automatic generation of multimedia content from a captured video lecture is not new. Once stored, videos create an extensive knowledge repository. They are, in itself, a disruptive source of education apart from textbooks and classroom study [1]. One can easily see the considerable attention paid to educational videos amongst all other video sources. The findings show that up to 65 percent of Berkeley students utilize videos to improve their understanding of subjects they missed in class [2]. The Cambridge International Global Education Census Survey 2019 states that approximately 41% of school-goers globally have taken an online course that was not part of their curriculum in the past 12 months [3]. The detailed literature survey in Section II clearly states that there is a paradigm shift in the mode of online learning. Learners are increasingly turning to online video lectures; however, they are less interested in watching the entire video and prefer to focus on specific topics of interest. Therefore, indexing of video lecture is the need of the hour. Mathematically, video indexing and annotation is the probability of discovering a topic and its associated text descriptor (keywords) in a video. However, in long video lectures, manually performing this task can be time-consuming. Additionally, new multimedia content demands comprehensive indexing and restructuring of information for a deeper understanding of a video. Therefore Automatic indexing and annotation is a better and effective solution. The literature also claims that deep learning is quite restrictive for such cases even though much work is done in video analysis and segmentation. A video can be analyzed in two different ways. On the one hand, a video can be examined based on underlying multimedia elements, i.e. speech and images. On the other hand, text inspection and keyword identification open a whole new dimension. This research intends to develop a technique for identifying automatic index points from a video lecture. These index points are nothing more than excerpts from a video's main headings and discussion topics. The current research involves studying all materials, including variable replay duration, classroom lectures, online tutorials, open courseware, and other comparable resources. A 137-layer YoloV4 Darknet Neural Network is used to create a custom object detector model. The model is trained on approximately 6000 video frames and then tested on a suit of 50 educational videos of around 20 hours of run time. Structural Similarity Index Matrix algorithm fused with Binary Search Pattern algorithm is used for improving the overall performance of shot boundary detection.

A. RESEARCH CONTRIBUTIONS
The motivation of this research comes from the fact that accessing the range of interest in a lecture video is not easy because there is no table of contents or index sections. The temporal pace of the sequences only affords the user to explore the video linearly. Through this research technique, a video lecture is automatically partitioned into segments based on image and text analysis. This process is called "Index Point Detection", and the identified keywords are termed "Index Points". Each 'Index Point' will also contain the timestamp of the relative video time instance. The textual content of the videos will be obtained by using Google's Tesseract OCR Engine. The research is limited to educational videos that contain presentation slides in the English language only. Using these indices, Users can easily search topics to find a particular segment of a video file or create a digital database. Learning blended with Natural Language Processing is undoubtedly the most gripping scientific frontiers that persuade the researchers to explore novel dimensions in indexing and partitioning, thereby producing a coherent video document structure. Thus, the main contribution of this research are: • To study the existing methods of automatic video indexing and annotation to analyze the outcomes and gaps. • To develop a methodology through which a video lecture is automatically partitioned into segments based on image and text analysis. • To identify and implement an accurate shot boundary detection technique for abrupt transitions. • To automatically detect the index points within a video by validating the trained YoloV4 Custom Object Detector. • To apply accuracy measures and provide a comparative benchmark for further research in this area and produce a significant impact. • To provide a method for intelligent information retrieval, researchers can find the applicability of this technique in videos of other domains. • To support and enhance the research community knowledge.

B. ARTICLE ORGANIZATION
This research work focuses on developing a technique for identifying automatic index points for a video lecture. Section II takes a close look at the theories and related research about video indexing to grasp detailed literature. Section III discusses the technical perspective, research methodology and the technique to facilitate the algorithm's development based on research gaps identified in the literature and the objectives of the investigation. Since we have used a subjective criterion to evaluate the success of our heading detection algorithm, the hypothesis and ground truth are discussed in Section IV. After successfully implementing the algorithm, it is then tested, and the output is displayed graphically in Section V. Finally, the complete work is concluded in Section VI, and future work is suggested in Section VII.

II. INVESTIGATIONS IN AUTOMATIC VIDEO INDEXING
In a video, words have various meanings depending upon the context. As the saying goes, "sometimes, they are important, while others are just a recollection of a former viewpoint." When looking at the appearance of an index point in a video, it is essential to understand its importance [4]. The literature review tries to uncover the critical characteristics to decide a topic change in a video lecture. A categorical literature survey is done based on the techniques intended to be performed during the research and tabulate the performance, relative merits, and limitations of existing approaches.

A. INDEXING OF VIDEOS
The repository of educational videos is immense, and it is impossible for a learner to copiously go through it and narrow down the content of interest. Rosalind Picard proposed Affective video indexing in 1990 while defining practical computing. As per the definition given by her, the need for indexing a video can be summarised as "Although affective annotations, like content annotation, will not be universal, they will still help reduce time searching for the right scene". Automatic video indexing is indispensable to make a video interactive and autodidact. The purpose of indexing is to divide a lecture video into parts that reflect distinct sub-topics. A successful index for a video lecture can only be created when the contents are automatically analyzed to extract relevant metadata. The first step is to identify all the timestamps in the video when the scene changes considerably. This will allow for segment-wise browsing of the video and content analysis explicitly targeted to the category. Furthermore, a subset of these locations is chosen as index points, representing the start of a sub-topic, which subsequently serves as a part of the index for the video. The specific problem this research will cater is to extract the keywords from the video frames based on which a lecture can be segmented. Each such keyword is called an index point, which will also contain the timestamp of the relative video time instance.

1) Existing Approaches for Automatic Video Indexing
Many researchers have developed frameworks and systems for video segmentation either for information retrieval or for content-wise indexing. Mottaleb et al. [5] suggested one of the oldest pieces of literature in this field that use a combination of content analysis and physical property data to pick the shots shown to users. Riedl et. al [6] proposes a novel algorithm 'TopicTiling' which is based on the standard 'TextTiling' algorithm for text segmentation where blocks of text are compared via bag-of-words vectors.
Their work proves to be significant in terms of complexity and computationally less expensive than other LDA-based segmentation methods. F. Sauli [7] researched hyper videos that introduce hyperlinks within a video. The work focuses on the educational domain and provides clickable access to video components like pictures, web pages, and text. Biswas et al. [8] developed a rank model MmtoC for essential keywords (salient words). Both visual slides and voice transcripts were utilized to create the sentence. Cost functions are derived depending on the terms used in a ranking. The topic-oriented indexing of the movie is generated using the programming optimization method, which accelerates processing. An analysis of the textual content is conducted, and lecture video segmentation is based on the linguistic characteristics of the text. Lin et al. [9] convert lecture videos into text analyses of the resulting text to find dissimilarities and find ways to improve them. To distinguish context-dependent information from content-based information is one of the main challenges for video indexing researchers. Uke and Thool [10] aims to provide a digital index to education videos by converting videos into images and extracting the text directly through the Google Tesseract OCR Engine. Merler et al. [11] is one step ahead. The concept of both types of research is the same, unlike the latter extracts and recognizes the text directly from the video rather than converting the whole video into images. Adcock et al. [12] worked on lecture webcasts and tried to provide a searchable text index so that users could access local material within a video. The technique focuses on videos with presentation slides only and performs keyframe extraction using associativity of the textual content. Hui et al.
[13] Proposed a new way to improve the interpretability and manipulability of a video by combined keyframe extraction and object-based video segmentation method. The similarity and redundancy between the frames is removed through the Kullback Leibler distance (AIKLD) criterion. There could be hundreds and thousands of videos on a single topic-all of them with different points of view and numerous instances of more profound insight. This sub-section claims that manual annotation or topic-wise indexing is not new in videos; it is sometimes time-consuming due to the length of a typical video lecture. Automatic indexing and annotation is, therefore, a better and efficient solution.

2) Automatic Video Indexing using Deep Learning
Deep Learning for automatic video indexing opens a whole new dimensions. Table 1 summarizes the review of research articles and their proposed idea for deep learning techniques, which are already used for video indexing. A method of automated indexing of videos using a text-based query is suggested by [20]. The users are asked to provide the keywords as a query; the framework returns a list of videos. Jothilakshmi [21] developed a novel method for content identification of news scenes in the NEWS Dataset and suggested it for automated content recognition systems. When making a transcription, this method comes with limitations. TV news sequences can only be processed using a commercially available programme, classifying them into one of the six predetermined categories (National Politics, National News, World, Finance, Society & Culture and Sports). Zhou et al. [22] proposes a technique proposed that includes automated video text panoramas to capture natural scene text information and picture stitching. It has been observed that [23] has obtained a very substantial 95.3% Recall and 98.6% Precision using CNN with Cosine Similarity. Lu et al. [24] indicate that deep convolutional neural networks combined with OCR Tools are also used nowadays to detect and recognize video text. Through the extensive literature survey, a conclusion can be drawn that there is a need for automatic video indexing in educational videos. To reduce the computational complexity, shot boundary detection and keyframe identification techniques are also indispensable. Even though much work is done in video analysis and segmentation, Deep Learning for such cases is still restrictive.

B. A REVIEW OF VIDEO STRUCTURE ANALYSIS AND SHOT BOUNDARY DETECTION TECHNIQUES
During the past decade, shot boundary detection techniques have been a sphere of influence on researchers worldwide working in video analysis and Image processing. A large portion of the research community has been devoted to shot boundary detection using edges, color, motion cues, object correlation, singly, or a combination. Numerous techniques have been developed, and several comprehensive VOLUME , 2021 CNN is used for obtaining global video-level descriptors, video classification and retrieval surveys have been presented to summarise them. It can be observed that although almost all the approaches detect Cut transitions efficiently, some recent approaches focus on detecting Gradual transitions. Table 2 compares the existing research approaches that are efficient in detecting either Gradual Transition, Cut transition, or both. Although almost all the approaches detect Cut transitions efficiently, some recent approaches focus on detecting Gradual transitions. Several methods have been implemented to detect shot boundaries, producing highly accurate and acceptable results. We can infer from the literature that Color histograms are the most used global features for video shot boundary detection. Histograms provide a good trade-off between accuracy and computational time. Color Histograms (CBH) uses RGB space for the computation while Hue Saturation Value (HSV) Histograms are computed by HSV Color Space [43]. The hue, saturation, and value (HSV) color space are more intuitive and alternate for the RGB color space. It also uses three dimensions to describe a color, which produces sturdy detection results [44]. Tuna et al. [45]discusses a detailed literature survey of Shot boundary detection techniques that detects Abrupt Transitions and Gradual Transitions. Ma et al. [46] detected gradual and abrupt transitions using a dual detection model by performing pre-detection. An uneven blocked mechanism is used based on the human visual system, histogram, and pixel difference for pre-detection. The false detection was removed in the detection phase by

TABLE2: Cut and Gradual Transition Detection
Year Reference CT Detection GT Detection 2021 [25] employing the Scale Invariant Feature Transform (SIFT). Amiri and Fathy [47] developed a noise-robust algorithm using Generalized Eigenvalue Decomposition (GED) and obtained a distance function. Cut transitions were realized when abrupt changes were noticed, and gradual transitions were obtained when semi-Gaussian behaviour was reflected in the distance function. Multilevel Difference of Color Histograms and the voting mechanism [38], and [48] used Convolution Neural Networks (CNN) to determine cut and gradual transitions. Hue Saturation Value is used for detecting abrupt shots, and a 3-dimensional convolution layer in CNN is used to obtain gradual transitions. Furthermore, CNN avoids the disturbance caused by abrupt transitions [30]. In [35], the Frobenius norm with double threshold and Singular Value Detection updating was used for detecting Cut and Global Transitions, respectively. A Modified Artificial Bee Colony algorithm is used to identify candidate boundaries. A hybrid of Fast Accelerated Segment Test and fuzzy histograms is further used to extract local and global features to verify the obtained boundaries [49]. Symmetric Local Binary Pattern with histogram features was implemented on six TV series to handle illuminating changes and detect hard cuts [2]. Temporal features were extracted from movies and trailers, followed by Dynamic Mode Decomposition to extract temporal background and foreground modes to detect hard cuts, Fade, and dissolve [5]. The high-Level Fuzzy Petri Net model with keypoint matching is used on Commercial videos, movies, TV shows, and dramas to detect gradual transition by removing false shots [32]. Audio and optical features were extracted from talk shows obtained using Daily Motion and YouTube websites using clustering-based algorithms to show that the number of frames is limited to two while the actor movement is more [50]. Candidate segments are obtained by comparing Oriented FAST and Rotated BRIEF (ORB) features, Cut Transitions are extracted by comparing Structural Similarity, and Gradual Transitions are detected in the gradual transition model in 106049 test frames from Open-video project, YouTube and YOUKU [51].

III. TECHNICAL PERSPECTIVE OF THE RESEARCH
Semantic characteristics can now be extracted from pictures and video sequences. There are still problems with video lectures because of their lengthy running duration, non-uniform material, and complex narratives. Automatic video indexing is a technique that presents a potential solution to these problems. This research is carried out in 3-phases, each of which has its significance and contribution towards better accuracy and less computational complexity as depicted in Figure 1. Recall, and F1 Score using the Mean Opinion Score. When a video is split into frames, numerous boundary and non-boundary frames are generated. The number of non-boundary frames is comparatively higher than that of boundary frames. Since the non-boundary frames are redundant, we need to eliminate them to improve the computational complexity. For this purpose, Shot Boundary Detection is used. When Shot Boundaries are successfully identified, there is a single keyframe that represents a shot. We call that keyframe as 'Candidate Segment'. We can conclude that in a non-boundary segment, the first frame and the last frame are highly correlated [52]. The significant phases in the proposed work are gathering custom data set, shot boundary detection and candidate segment identification, customizing configuration files necessary to train the Neural Network Model, Validating training data for accuracy and finally testing the custom object detector to see the model's accuracy in real-time.

A. FRAMES EXTRACTION AND PROCESSING
This study is based on a dataset of video lectures, and the model is trained on videos ranging in length from 30-45 minutes, for a total run time of 22 hours. Lecture videos often feature a frame rate of 30, which means that 30 frames are produced every second [53]. As a result, the average number of frames per video we have worked on is 35,000 -40,000 frames, with just 25-30 boundary frames. Therefore, Frame reduction is a significant challenge: the fewer the frames, the faster and more effective the video processing is. Henceforth, initially, to reduce the number of frames, We assumed that since a topic will be debated for at least 10 seconds in a video lecture, a delay of the same can be added in frame generation, and subsequent redundant frames can be reduced. For example, the total amount of frames initially produced in our dataset video entitled "What is JWT JSON" was 38,680. The frames are reduced to 19,015 after a 10-second delay, which is almost half the number. Another critical aspect of frame generation is keeping track of the frame's arrival time w.r.t. video playback. This frame timestamp indicates the exact point in time when that frame appears in the video. Since the purpose of the research is to provide automatic video indexing, timestamps plays a very vital role in video browsing and navigation.

B. BINARY SEARCH PATTERN ALGORITHM (BSPA)
Binary Search Pattern Algorithm (BSPA) also known as half-interval algorithm . It is a searching algorithm that searches for a targeted image frame in a sorted array of frames. In our case, the frames are sorted based on their index A ← Array_of _f rames 3: n ← Size_of _array 4: x ← Boundary_f rame Compare ssim_left and ssim_right 8: if ssim_lef t = ssim_right then Set ssim_mid = (ssim_left + ssim_right) / 2 9: elsessim_pivot = ssim_mid 10: Compare ssim_pivot and ssim_right 11: if ssim_pivot = ssim_right then boundary_frame = ssim_right End Procedure number. BSPA is required to reduce the computational time from O(n) to the logarithmic time, O(log n), where n is the number of elements in the array. Algorithm 1 represents how BSPA is used to divide the array of n frames and select the pivot frame.

C. STRUCTURAL SIMILARITY INDEX
The similarity is a computed value between two images that determines the pixel or visually similar the images are. There are various methods to compute the similarity between two images like Template Matching, Image Descriptors (such as SIFT, SURF and FAST) and Structural Similarity Index Measure (SSIM).
We have worked on the SSIM approach in this experiment because of its easy implementation with satisfactory results. SSIM is a metric that looks for the similarity between the pixels of two images. If the pixel density is remarkably similar, SSIM will return 1, and if it is vastly different, SSIM will return -1. The assessment index of quality is based on the computation of three expressions: Luminance expression, Contrast expression and Structural expression as per below equation 1: Here, l stands for Luminance, c stands for the Contrast and s stands for the Structural expressions respectively. The expressions for Luminance, Contrast and Structure can be observed in equations 2, 3 and 4.
In the equations above µ a , µ b shows the local means, σ a , σ b shows the standard deviations, σ ab shows the cross-covariance for the images and z 1 , z 2 , z 3 are SSIM constants.
Here, the assumption is that α = β = γ = 1 and z 1 = z2 2 , then the above equations simplifies to equation 5: Technically, SSIM is based on three utility functions that perform all essential tasks.
• gaussian Function that generates an array of numbers sampled from a gaussian distribution. The length of the array is equal to the size of the window. Sigma σ is used to represent the standard deviation of the distribution. • create_window Function is used to create a 2-dimensional array by multiplying the array (generated by gaussian Function) with its transpose. • ssim Function is used to perform the mathematical calculations, including luminance, contrast and SSIM score based on standard formulae.
There is a plot twist while implementing SSIM, that for image quality assessment, it is useful to apply the SSIM index locally rather than globally. Therefore, we have used the Mean Index Value for Structural Similarity. Algorithm 2 represent the sate-of-the-art SSIM approach implementation. It can be seen in algorithm 2 that the similarity score between two image frames is kept to be 0.42 (i.e. 42% similarity for non-adjacent frames), which is the mean similarity score for the SSIM approach.
Algorithm 2 SSIM Threshold conditions: left, mid, right FnFunction: F if the subtracted value of left and right is greater than five then, return Algo(left, mid) elseright value is greater than left+1 if SSIM score is less than 0.42 then, return Algo(left, mid) return Algo(mid, right) end function

D. CANDIDATE SEGMENT IDENTIFICATION USING STRUCTURAL SIMILARITY AND BINARY SEARCH PATTERN
Identifying correct and accurate shot boundaries is the backbone for successful video segmentation and content-based video retrieval. The maximum part of this research revolves around the optimizing technique to identify abrupt transitions for shot boundary detection based on Structural Similarity (SSIM) fused with Binary Search Pattern Algorithm (BSPA). The algorithm for the same is represented in Algorithm 3. We tried to make the overall solution less costly and computationally efficient. if L_Sim&R_Sim then Sim = True return Index The proposed approach and the state-of-the-art SSIM Technique are examined for shot boundary detection on python 3.7.1 with a 1.80 GHz CPU and 8 GB RAM with 5900 image frames to gauge the effectiveness. The results quantified on the basis of precision, recall, and F1 score. The comparative results show that the proposed algorithm successfully detects abrupt transitions with reasonable accuracy in detection and significantly decreases the complexity of the algorithm. The input to the system is a video sequence that is converted to image frames. An index number 0 th to (N-1) th is assigned to each frame, where N is the total number of frames. A 5-frame window is created to check the structural similarity of the two frames. The algorithm checks if the frames in the window are less than five or more. If the number is less than five, it is evident that the last frame is the boundary frame, and hence the algorithm returns the last frame; otherwise, the window is moved forward to the next set of 5-frames. The location of the frames that have to be compared is divided into left, mid and right frames. The 0 th and (N-1) th frames are compared by the standard SSIM approach in the first iteration. If they are found similar, we can instantly conclude that the video contains only a single shot. However, this case is only possible for videos with few seconds of running time, and those videos are not targeted in the proposed research. Else wise, when the 0 th and (N-1) th are dissimilar, the algorithm will divide the frames in half. The execution of the 5-frame window will be as follows: • Checks the SSIM score of the first and last frame inside the 5-frame window. • If the SSIM score ≤ 0.42, then the frames do not belong to the same shot, and therefore ALGO(left, mid) is executed. • Otherwise, it can be concluded that since the SSIM score ≥ 0.42, the frames are adjacent, i.e. belongs to the same shot. • Move the 5-frame window to the next five frames by executing Algo(mid, right). • Return the left frame as boundary frame.

VOLUME , 2021
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  Table 3 shows the comparison between the state-of-the-art SSIM and proposed approach for abrupt shot boundary detection. The results are compared based on the average percentage of precision, recall and F1-Score as depicted through Figure 2, Figure 3 and Figure 4 respectively. The comparative analysis of existing state-of-the-art techniques is compared with the proposed method, as displayed in Table 4. The only limitation of our proposed approach is that if a shot is repeated, the proposed method considers it as a new shot and fetch its boundary frame, so the boundary frames might be similar, hence increasing the redundancy.  The computational time for detecting Shot Boundary through the proposed approach improves significantly compared to the traditional SSIM Technique. The average time by proposed approach is 129 seconds, which is 21.4% lesser than the basic approach. We can conclude that the proposed algorithm is 20-25 times faster. The average precision is 83% which is the same for both approaches,

E. INDEX POINT DETECTION USING YOLOV4
After successfully identifying Candidate Segments, the next important task is to extract the index points from those image frames. YoloV4 Darknet custom object detector is used for automatic extraction of index points. Darknet is a robust open access neural network framework geared towards object detection, using CPU and GPU computation. The unique architecture and high speed distinct the Darknet Framework from other existing NN architectures. A specialized framework such as YOLO is found in the Darknet. YOLO (an acronym for "You Only Look Once") can run on CPU, but it gains up to 500 times more speed on GPUs because it uses CUDA and cuDNN. Another positive aspect of Darknet is that it is comparatively convenient to train compared to the other heavily optimized and popular frameworks in customized data sets. Through this research, we concluded that the method of automatic keyword extraction using Deep Neural Network gives satisfactory results in terms of speed and accuracy.

1) Configuration
This technique uses a 137-layer YoloV4 custom object detector neural network model for identifying the headings/keywords in the image frames. The neural network model is trained on approximately 6000 video frames and then tested on a suit of 50 educational videos of around 20 hours of run time. The comparative results viz. detection accuracy, precision, recall, and F1 score show that the proposed algorithm successfully generates a digital index with reasonable accuracy in topic detection. The technical configuration of YoloV4 is mostly summarized This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  [54] 120x real-time Misses long dissolves, partial scene changes and scenes with motion blur ORB Fused with Structural Similarity [55] device-independent, low computational burden and high accuracy Some false-positive caused by flicker and blur Spatio temporal Regularity of Video Cube [56] High accuracy, no false or missed detection High-speed GPU and high computational cost Hybrid Key point Detection [32] High accuracy Fail to identify the essential key points Proposed Method High recall, low computational time, burden Some boundary frames might be redundant by 4-python files as in figure 1 viz.:

1) Detection.py
Detection.py uses SSIM, OpenCV as imports and gets the unique Frames from SSIM.py and gets the title frames from the YOLOv4 model, which is trained specifically for this purpose. Then, after detection, it yields the data to UI.py in the below format:-'frame_no':frame_no,'x':x,'y':y,'x + w':x + w,'y + h':y + h 2) Reducer.py This python script gets the video, applies the SSIM to the video frames, and yields the frames to Detection.py when a unique frame is found.

3) UI Runner.py
The file starts the UI-the program's starting point (UI, Detection as Imports).

4) UI.py
The file that initializes the UI Sets up VLC Media Player and uses the Extracted title frames from detection, gets the text image using OpenCV, applies SSIM like reducing redundant frames and converts them to text using PyTesseract and timestamps the unique remaining frames. (uses Detection, Pyside 2, python-VLC, SSIM as imports).

2) Heading Extraction
The research states the application of Single Class YOLOV4 as a custom object detector through Google Colaboratory. Approximately 6000 video frames from different videos were included in our dataset. 90% of frames were used for training our model, and 10% frames were used for validation. Yolo performs unzipping and inflating all our training images and our validation images. Testing of the model is done explicitly on a suit of 50 different video lectures. A sample result of heading detection can be seen in figure 5.

IV. DATASET, HYPOTHESIS AND GROUND TRUTH
Training and validation data comprises of extracted frames from video lectures of various domains. The frames are then manually labelled through the makesense.ai object detection toolkit. Since this is a single class object detector, the only heading label is shown in figure 6, and each training image will have a .txt file associated with it which contains the coordinates of the heading in the image.
The number of index points within a video varies VOLUME , 2021 The difference of precision, recall and F1 score for heading detection through the proposed approach and Mean Opinion Score given by the external volunteer can be seen through figure 7. Several spurious frames are produced during detection, FIGURE7: Difference of precision, recall and F1 for heading detection through proposed approach and through Mean Opinion Score some false negatives are generated, and some unnecessary headings are generated due to PyTesseract's efficiency. It can be observed from the figure that the range of Precision, Recall and F1 Score of the proposed approach lies between 65-70%.

V. RESULTS AND DISCUSSIONS
This paper analyzes the results in two categories. show that the mean precision and mean F1 score is 70% as depicted in Figure 9 , whereas there is a subsequent increase in recall with 76% as shown in Figure 10. This shows that the number of True Positives is better than False Negatives.
On the trained YoloV4 model, 81.24% of keyframe detection efficiency was obtained after the first test images and the Range of Precision, Recall, and F1 score varies between 60-75% which is considered to be reasonably good. accuracy, Precision, Recall, and F1 Score are used to reflect overall performance and outcomes. The threshold values are based on experimental observations of multiple videos. As this is a new approach, much work can be done in the future. For instance, an adaptive threshold value can be used instead, and the performance is still abysmal in the case of handwritten text in the videos. The framework used a basic binary search algorithm infused with Structural Similarity Index. The dataset used for training the neural network is made from 6000 video frames, and for testing, a suite of 50 videos was randomly selected. It is important to note that the manual annotation used for training has limitations. For example, during the random data collection to create a training dataset, one cannot always mention the annotators' context, accuracy, and qualification. However, to confine the scope of the study to a specific domain, the experiments were structured to include only participants with an educational context and to cover videos from educational portals. Furthermore, manual indexing is only used for evaluation and not for the generation of index points. As a result, our subjective assessment of the proposed algorithm using three different types of experiments is sufficient to determine its efficacy.

VII. FUTURE ASPECTS OF THE RESEARCH
The present study tackled some aspects; nevertheless, more in-depth optimization and benchmarking are necessary to study the effect on the efficiency of index point detection for standardization and relevant summaries. In the future, we shall explore the applicability of this technique and any potentially needed extensions. This research discovered that the frequency of words, n-grams, and the number of first-time words that appeared in a video provide valuable information for video segmentation by topic. It is therefore expected to do a long-term, real-world effect analysis to evaluate keywords in instructional videos. The research findings can also be improved by synchronizing speech with textual data. The spoken words can be extracted using speech transcripts and based on the context and semantics of the speech, and they can be associated with image text. DR. SUDHA MORWAL is currently working as an Associate Professor in department of computer science, Banasthali Vidyapith. She is MSc, MTech and PhD in computer science and has played a prominent role in the department's academic activities for the last decade. She is co-author of many books on computer science in Hindi and English. She has published many papers in national and international journals and conferences. VOLUME , 2021