Fast Shot Boundary Detection Based on Separable Moments and Support Vector Machine

The large number of visual applications in multimedia sharing websites and social networks contribute to the increasing amounts of multimedia data in cyberspace. Video data is a rich source of information and considered the most demanding in terms of storage space. With the huge development of digital video production, video management becomes a challenging task. Video content analysis (VCA) aims to provide big data solutions by automating the video management. To this end, shot boundary detection (SBD) is considered an essential step in VCA. It aims to partition the video sequence into shots by detecting shot transitions. High computational cost in transition detection is considered a bottleneck for real-time applications. Thus, in this paper, a balance between detection accuracy and speed for SBD is addressed by presenting a new method for fast video processing. The proposed SBD framework is based on the concept of candidate segment selection with frame active area and separable moments. First, for each frame, the active area is selected such that only the informative content is considered. This leads to a reduction in the computational cost and disturbance factors. Second, for each active area, the moments are computed using orthogonal polynomials. Then, an adaptive threshold and inequality criteria are used to eliminate most of the non-transition frames and preserve candidate segments. For further elimination, two rounds of bisection comparisons are applied. As a result, the computational cost is reduced in the subsequent stages. Finally, machine learning statistics based on the support vector machine is implemented to detect the cut transitions. The enhancement of the proposed fast video processing method over existing methods in terms of computational complexity and accuracy is verified. The average improvements in terms of frame percentage and transition accuracy percentage are 1.63% and 2.05%, respectively. Moreover, for the proposed SBD algorithm, a comparative study is performed with state-of-the-art algorithms. The comparison results confirm the superiority of the proposed algorithm in computation time with improvement of over 38%.


I. INTRODUCTION
With video multimedia data growth, internet traffic is rising from a moderately consistent stream to a dynamic traffic pattern [1], [2]. Searching the entire video for specific content is time-consuming. Moreover, video production and creation The associate editor coordinating the review of this manuscript and approving it for publication was Larbi Boubchir . remain challenging [3]. Therefore, in the last decade, efficient management systems for indexing, browsing, and retrieval of videos have been considered. Content-based video indexing and retrieval aim to automate video structure analysis. Consequently, shot boundary detection (SBD) is a pre-processing step in video analysis [4].
Video is a three dimensional (3D) signal that involves a combination of text, audio, and images with a time dimension. A video sequence generally comprises frames, shots, scenes, and stories. Video shots, or simply shots, are the basic units in a video and are considered a useful tool in providing semantic search [3]. Each shot consists of a sequence of frames, and each video frame represents a single image [5]. In a video sequence, shot boundaries are divided into a cut transition (CTR) and a gradual transition (GTR). CTR is the sudden transition from one video shot to another, and GTR represents the gradual transition between two shots and involves multiple frames. GTR are categorized into fade, dissolve, and wipe transitions [6]. Transition detection in SBD algorithms can be categorized based on the classification technique used, which are: 1) machine learning technique, 2) rules-based technique, and 3) combination of machine learning and rule-based techniques [7]. The SBD algorithm based on machine learning can be divided into supervised and unsupervised.
Generally, the essential process in the SBD algorithms is the feature extraction process. In this process, the significant representation of the visual information is considered the main goal [8]. Feature extraction can be classified based on the domain where the frame is processed. The compressed and uncompressed domains are the existing processing domains [9]. Due to the valuable information in the uncompressed domain, the SBD algorithms are primarily centered on it [7]. Different algorithms are used in the uncompressed domain such as: pixel-based algorithms [10], transformbased algorithms [11], histogram-based algorithms [12], and edge-based algorithms [13].
Transform-based algorithm have been employed by many researchers where the transform coefficients are extracted and considered as features. Discrete Walsh-Hadamard transform, discrete Wavelet transform, and discrete Fourier transform are examples of transforms used in SBD algorithms which achieved acceptable performance in detecting transitions [14]. However, these algorithm are computational expensive [15] because of processing the frames in a video sequence.
Processing all video frames will result in high detection accuracy. However, considerable computation time is consumed in processing non-boundary frames. To reduce the computation cost by eliminating the non-boundary frames, an additional level in video processing operation is applied. This level is considered a pre-processing stage in reducing computation in subsequent stages.
Our contribution can be summarized as follows: • Develop an efficient candidate segment selection algorithm to eliminate non-transition frames and preserves candidate segments. The candidate segment selection algorithm applies bisection technique. In addition, the criteria used for selecting the candidate segment is developed and shows improvement in the performance compared to existing algorithms, which reduces the time required to process the entire video for shot boundary detection.
• Present a new active area selection technique. The presented technique utilizes 2/3 and 7/8 of the frame's height and width, respectively. The active area ratio is selected after performing several experiments. These experiments reveals that the selected active area shows better performance in terms of processing time for the shot boundary detection.
• In the design of the proposed SBD algorithm, the squared Tchebichef-Krawtchouk polynomials (STKP) is employed for the first time in SBD algorithm. The employed polynomials show better performance in terms of localization property and energy compaction when compared to existing algorithms. Also, the employed polynomials have been proved in other computer vision algorithms.
• Finally, the design of the proposed SBD is designed based on the previously mentioned tools. In addition, the proposed algorithm shows a noticeable improvement in terms of computation speed while preserving the accuracy.
This paper is organized as follows. Section II describes related work. Section 3 presents the mathematical fundamentals of STKP and moment computations. Section 4 introduces the proposed fast video processing method. Section 5 provides a performance evaluation of the proposed method. In Section 6 we present some conclusions.

II. RELATED WORK
Reducing computation cost without sacrificing accuracy is a challenging task in big multimedia data analytics [16]. In the past two decades, several algorithms have been proposed to develop detection accuracy with fast video processing. This development is considered valuable support for a high-level application of video structure analysis [17]. Most existing algorithms consist of three levels to detect transitions [6]. Feature extraction is the primary level to represent rich content (visual information) of the video frames [18]. Different approaches have been proposed for visual information representation (feature extraction). Pixel-based [19], histogrambased [20], and transform-based [6] are examples of different features used in these approaches. The transform domain (moment domain) offers a powerful ability to analyze a signal's components and suppress noise [21]. The next step in SBD is constructing similarity and dissimilarity signals by extracting temporal characteristics. These characteristics will specify the variations between consecutive frames or between frames with inter-frame distances [6]. At the shot transition, the similarity signal (SS) has low values, whereas the dissimilarity signal (DS) has high values. The opposite is applied within the same shot [6]. The last level in identification is to classify the SS/DS values into cut/gradual transitions and non-transitions [6].
All video frames must be processed using SBD algorithms. For further computational cost reduction, Li et al.in [22] proposed a pre-processing method, in which only the candidate segments are selected before the SBD main levels. Every 21 frames are formed as a segment and every 10 segments are grouped. The adaptive threshold with bisection-based comparisons was used to eliminate the non-boundary transitions. However, this method showed sensitivity to object and camera motion when the pixel-based approach is used. Moreover, it only approximated the location of the transitions [23].
In [23], an improved adaptive threshold was used. A histogram-based approach and singular value decomposition were applied to determine the actual transitions. In addition, the filtering process is implemented to eliminate false alarms. However, because of the histogram method, the detection was not accurate due to the possibility of mis-detection. In [24] the same adaptive threshold as in [23] was implemented to improve the speed of the SBD algorithm. However, its accuracy needs to be improved [15].
Tippaya et al. in [15], [25] selected the candidate segments by extracting the temporal characteristics using speeded-up robust features (SURF) descriptors and histograms. Despite the high accuracy from the multi-modal features, the execution speed was very low.
Dhiman et al. in [26] proposed a method similar to the one in [23]. In addition, only the blue plane is used for feature extraction to reduce the computations. However, here, an SBD algorithm needs to be devised to achieve detection accuracy with low computation cost. In addition, there are SBD algorithms employ deep neural networks. Xu et al. [24] presented a SBD algorithm to detect transitions based on features extracted from convolutional neural network (CNN); however, the detection accuracy still low. Gygli [27] and Hassanien et al. [28] proposed an SBD algorithm using CNN with fully connected architecture. Hassanien et al. predicted a likelihood of transition within a sequence of 16 frames using the convolutional 3D network [29]. Compared to Hassanien et al., Gygli's algorithm has a smaller network and without post-processing. On the other hand, in Hassanien et al., a support vector machine (SVM) classifier is utilized (the predictions are not used directly).The performance of Hassanien et al.algorithm outperformed Gygli's algorithm and the reported F1-score was 92.5%.. In addition, the algorithms based on deep neural networks are significantly are running on expensive GPU and require a very large training dataset.
In the algorithms mentioned above, the entire video frame is processed for each candidate segment. In [7], Abdulhussain et al. proposed an SBD algorithm was based only on the selection of frame active area. By eliminating the persistent and variable materials, only the valuable material with rich content is preserved by considering 7/8 of the frame height and 3/4 of frame width. Furthermore, the candidate segment selection is applied based on a threshold-based approach and the inequality criteria. The SBD features are extracted based on the moment computation. The local moment is computed by applying the moment block processing proposed in [8]. Then, a group of three local moments is computed by applying embedding operators. However, further elimination for the non-boundary frames can be applied by applying additional comparison criteria. Therefore, additional effort must be made to provide a compromise between the detection accuracy and the computation cost.

III. PRELIMINARIES
Orthogonal polynomials (OPs) are considered efficient tools in several applications such as information hiding [30]- [32], face recognition [21], SBD [7], speech enhancement [33], [34], and handwritten numerical recognition [35]. The moments are the projection of signals on OPs [21], [36], [37]. In this section, the mathematical definitions of STKP and associated separable moments are presented. These moments are used to transform the video frames into the moment domain and form the features. In this paper, STKP is utilized as an OP since it has been proven that its performance outperforms other OPs in terms of energy compaction and localization property over other existing OPs [38]. In addition, the extracted features affect the selection of the candidate transition, as shown in Section IV-B.

A. STKP DEFINITION
A new orthogonal polynomial is generated by multiplying two orthogonal polynomials [39]. From this perspective, STKP is generated by multiplying two hybrid forms of polynomials, namely, Krawtchouk-Tchebichef orthogonal polynomial (KTOP) [40] and Tchebichef-Krawtchouk orthogonal polynomial (TKOP) [39]. These hybrid forms are generated from Krawtchouk orthogonal polynomial (KOP) [41] and Tchebichef orthogonal polynomial (TOP) [42]. KTOP is defined as follows [40]: n, x = 0, 1, · · · , N − 1 (1) where T and K are the TOP and KOP, respectively. p is the control parameter of KOP. TKOP can be expressed as follows [39]: The STKP form can be defined by combining (1) and (2) as follows [38]: The moments are considered a beneficial tool to characterize data by preventing data redundancy. Subsequently, 1D and 2D signals are described by the moments. For a two dimensional (2D) signal I (x, y) with a size of N × N , the moment can be defined as [38]: where Z n (x), and Z m (y) are the orthogonal polynomials. The reconstruction of the 2D signal can be computed as follows: The matrix representation is used to provide a second faster and simplified form of the moment in Equation (4) as follows [38]: where Z 1 and Z 2 represent the matrix form of Z n (x; p, N 1 ) and Z m (y; p, N 2 ), respectively. The transform coefficients (moments) are considered the features that describe different types of signals [43]. To select the low-order moment with high energy coefficients, the specific order (ord) is specified. The coefficients row-wise for each generated polynomial (Z 1 and Z 2 ) are selected such that the selected coefficients are equal to the required order [38].

IV. PROPOSED FAST VIDEO PROCESSING METHOD
The framework of proposed algorithm includes three steps which are: the feature extraction, constructing the DS, and the identification of CTR. A preliminary processing is performed to decode the video frames. Then, only the active area of the frames is considered to preserve the valuable visual information and provide a reduction in computation time.
In the proposed algorithm, STKP is used to extract features. Two types of features are employed in the proposed algorithm which are: global and local features. The global features are used to specify the candidate transition locations in the process of candidate segments selection. The local features (moments) are considered for visual representation because they are robust against disturbance factors. Therefore, an OP block processing [8] module is employed to reduce the effect of video sequence disturbance as well as the computational cost.
Thereafter, DS is constructed to form a feature vector which dissimilarity of the contextual (temporal) information to improve the detection accuracy. The resulted feature vector is used in the next stage to identify the CTR.
Finally, the CTRs are detected in the identification stage, which uses a binary SVM model to classify the video sequence into transition (CTR) and nontransition frames. The general framework of the proposed algorithm is shown in Figure 1. In this section, the details of the proposed algorithm are presented. First, the active area of the frames is used to reduce the processed video frame area. Second, the proposed pre-processing module is applied to select the candidate segments.

A. MODIFIED FRAME ACTIVE AREA SELECTION
The active area selection is used to select the frame's region that holds most of the visual information. The visual material in the frames can be divided into three types: persistent, variable, and valuable material. The persistent material usually appears at the upper and lower ends of the frames. Fixedlogo, fixed-subtitle, and fixed-intensity regions are examples of persistent material. These persistent visual materials affect the similarity measurements because they are similar with different shots, as shown in Figure 2(a). The variable material has a significant effect on the dissimilarity measurements. Animated logos, animated subtitles, and transcripts are examples of variable material, as shown in Figure 2(b). The last type, valuable material, contains the rich features and considers the active part of the frame, as shown in Figure 2(c). Therefore, the method presented in [7] is modified and used to select the frame active area by removing a portion of the persistent and variable visual material.
For each video frame of size N 1 × N 2 , the frame active area selection is defined as follows and the resulting frame is of size N A1 × N A2 : where prc 1 and prc 2 are the selected percentages for the frame active area regions. This selection will guarantee that the extracted features are accomplished from more reliable regions and that the computation is reduced by reducing the frame size, increasing the video processing speed.

B. DEVELOPED CANDIDATE SEGMENT SELECTION TECHNIQUE
The primary step used to provide a low computation cost is the selection of the boundary transition and the elimination of the non-boundary transitions [22]. This selection will prevent a frame-by-frame linear scan.
Given that the frames within short temporal segments have a high similarity, the first and last frames of each segment are checked by measuring their similarity. If they are similar, the segment is marked as a non-boundary segment. If the measurement shows they are dissimilar, the segment candidate is marked as a transition segment. The non-boundary transitions require no further processing, thereby considerably reducing processing time.
The method proposed in [7] is developed to exclude additional non-transition frames. In the proposed method, the video frames with an active area are transformed into the moment domain using squared Tchebichef-Krawtchouk transform (STKT). Then, the global moment is computed using Equation (6) to form the features. The high-energy moment coefficients (features) are selected to describe each frame, as shown in Section V-A. This selection will also reduce computation complexity.
A pre-processing module based on an adaptive threshold, inequality criteria, and bisection-based comparison is designed to filter out the transitions. Then, the candidate segment location is obtained and used in the next sub-stages. The pre-processing module is described as follows.

1) ADAPTIVE THRESHOLD
The adaptive threshold is used to filter out the non-boundary video segments. The first step in the adaptive threshold is to partition the video into certain skipped frames (S f ) for each segment. Between every two consecutive segments, one overlap frame provides the temporal continuity [44]. Therefore, the segment length is equal to S f + 1. Then, the distance of the first and last frames in the k th segment is measured as follows: where η(f i ) is the moment of the ith frame. Every ten segments are grouped, and the local threshold of these segments is computed. The local measurement is computed for each group, and the global measurement is computed for all the segments. The local and global statistics (mean and standard deviation) are used to adaptively calculate the threshold as follows: where mg is the global mean computed from all segment distances in the video, and, ml and sl are the local mean and local standard deviation, respectively. When the distance value of the segment is greater than the threshold, the segment is classified as a boundary segment. For these boundary segments, no additional comparisons are needed.

2) INEQUALITY NESTED CRITERIA
The false-positive detected boundary segments are better than false negatives (miss-detected segments) [22]. Therefore, further inequality criteria are used for the non-boundary segments because the discarded segments cannot be retrieved. These criteria will provide the relationship between the segments with their neighbors, defined as When the distance of the previously classified nonboundary segment satisfies the criteria, the segment is re-classified as a boundary segment.

3) BISECTION-BASED COMPARISONS
For further elimination of the non-boundary segments in each candidate segment, two rounds of bisection comparisons are accomplished. Given that the cut transition (CTR) may occur in 1 or 2 frames, the scope of CTR search must be precisely identified. Therefore, the first round in the bisection comparisons is to divide the segment frames into two sub-segments. The forward distance (D f ) is measured between the middle frame and the first frame. However, the backward distance (D b ) is computed between the middle frame and the last frame as follows: Then, the relationship between the obtained three distances in (9), (11), and (12) are used to identify the candidate segment type as follows: Type 2 Df /(D(k + 1)) < 0.55 and D b /D(k + 1) < 0.55 Type 3 Elsewhere Type 4 where: Type 1: means that the forward distance is larger compared to segment distance. In addition, the variance between the forward and backward is distinct. Therefore, the shot transition is in the first sub-segment frames only.
Type 2: means that the backward distance is larger compared to the segment distance. The variance between the forward and backward is also distinct. Therefore, the shot transition is in the second sub-segment frames only.
Type 3: means that no transition in the candidate segment occurs, and the transition is incorrectly identified. Thus, the segment is classified as non-boundary.
Type 4: means that the entire segment (the segment length is equal to S f + 1) is preserved because the segment may contain a shot transition.
The second bisection round is applied every (S f /2 + 1) segment-length to obtain two segments with a length of (S f /4 + 1) and repeat (13). For Type 4 in the second round, the segment is also perceived with (S f + 1) segment length. The bisection with two rounds is shown in Figure 3.
By the elimination of a large number of non-boundary frames, the computational cost is reduced. In addition, the obtained candidate segments are suspected for CTR because its length is only a small number of frames, and an accurate CTR position can be obtained. Figure 4 demonstrates the procedure for the candidate segment selection.

C. SBD WITH THE FAST-PROPOSED METHOD
The SBD framework includes three steps: feature extraction, DS construction, and CTR identification. Furthermore, a preliminary processing step is required to decode the video frames and perform color space conversion. Then, only the active area of the frames is considered in the subsequent stages to preserve the valuable visual information and provide the first step for computation time reduction. Therefore, the proposed frame active area is considered a sub-stage in the preliminary processing step.

1) FEATURE EXTRACTION OF SBD
Feature extraction plays a significant role in SBD because the content of the video frames is described by their features [18]. Therefore, in SBD, global and local features are extracted. The global features are extracted and used to find the location of the candidate segments. Thereafter, the local features of the candidate segments are considered to form the feature vector. The local features provide a robust visual representation compared with global features. Moreover, the local features resist object and camera motion and flash light effects, thereby improving detection accuracy [8]. Consequently, the mathematical model of moment block processing (MBP) proposed in [8] is adopted for direct local feature extraction with the reduction of the computation time. In this mathematical model, one matrix multiplication set is required to obtain the block processing result without requiring the video frame to be partitioned. The mathematical expression of MBP is defined as follows [8]: (14) where I is an image of size N A1 × N A2 with a block size of B 1 × B 2 and the number of blocks is equal to v 1 × v 2 , where v 1 = N A1 /B 1 and v 2 = N A2 /B 2 , and P B1 and P B2 is a single set of matrices for all the image two matrices are constructed from Z B1 of size ord 1 × B 1 and Z B2 of size ord 2 × B 2 , respectively, by first performing a horizontal concatenation with a zero matrix and circular shift by (v − 1) times. Then, we perform a vertical concatenation. In this study, the number of blocks is selected empirically as follows: v 1 = v 2 = 8 to compute the local features.

2) DISSIMILARITY MEASUREMENTS OF SBD
After obtaining the local features, the next subsequent stage in SBD is to find the DS between two computed moments of consecutive frames (f k and f (k + 1)). The DS is computed using the city-block distance metric, and the dissimilarity feature vectors (DFV) are formed for each of the candidate frames and defined as follows: (15) where loc candidate represents the location of the candidate frames in the candidate segments, and ord B is equal to ord × v.
To improve detection accuracy, contextual information is considered [6]. This temporal information represents the features of previous (Pre) and posterior (Pos) candidate frames. The resultant feature vector is used in the next stage to identify the CTR transition. However, the dynamic range of the resulting features is considered a problem in detection. Therefore, feature vector normalization is important to retain features within a similar range [41], [45]. The mapping processes matrices by transforming the mean (x mean ) and standard deviation (x std ) of each row to desired mean (y mean ) and desired std (y std ) of the k th feature vector (FV c ) as follows: (16) In this study, the feature vector is normalized with the desired mean and desired std equal to 0 and 1.2, respectively. This feature vector is used in the identification stage.

3) CTR DETECTION OF SBD
For the classification task, support vector machine (SVM), which is a popular supervised machine learning technique, is implemented [45], [46]. SVM has a significant advantage and is easier to implement than certain classifiers, as follows. First, compared with K-nearest neighbor (KNN), SVM provides better classification results for noise-free and noisy environments [38]. Second, compared with a neural network, SVM solves the local convergence problem [47]. Finally, compared with classical deep learning, SVM requires fewer parameters to be initialized with a smaller number of training samples; therefore, less training time is required [48], [49]. Moreover, deep learning requires many more logistical architecture requirements; hence, it needs much more computational power and resources than SVM does [49], [50].
In SVM, data are usually divided into training and testing sets in the classification task. The SVM generates a model from the training sets because they contain numerous attributes (features) and a label (target value) for each case. This model will be used to predict the test labels depending on the testing features only [45]. To improve performance, the radial basis function (RBF) is used with SVM. RBF represents a suitable choice because it provides non-linear mapping [45]. Moreover, it reduces the complexity by reducing the number of generated hyperplanes [45]. X-fold crossvalidation is applied to tune the SVM cost parameter and RBF gamma parameter and overcome the over-fitting problem. SVM is implemented by using the LIB-SVM package [45], [51]. Figure 5 shows the block diagram of the SBD with fast video processing method.

V. EXPERIMENTAL RESULTS
This section discusses the performance of the proposed method in terms of speed and transition accuracy. Subsequently, a comparative analysis is performed to demonstrate the efficiency of the proposed method. The performance of the proposed fast and accurate method is evaluated by using a well-known dataset. TREC Video Retrieval Evaluation (TRECVID) 2001,2005,2006, and 2007 test data, which is co-sponsored by the National Institute of Standards and Technology, are used [52]. The test sets comprise many CTRs and GTRs, and the video sequences are reformed into the uncompressed AVI format. In this study, CTR is used to test the ability of the proposed method because it is more predominant than GTR [6], [53]. Table 1 shows the details of the TRECVID datasets. The table shows that the number of non-boundary frames is very large compared to the transition frames (CTR). Therefore, the elimination of these frames to extract global feature 7: Candidate_Segments_Selection(f A , S f ) process all frames 8: to extract local feature 9: Apply OP block processing 10: {η} ← moment extraction smooth & gradient moments 11: {D} ← Dissimilarity(f k , f k+1 ) 12: FV n ← (contextual information and feature normalization) 13: Cut_Transition_Detection( FV n , N gt ) 14: end procedure 15: function Candidate_Segments_Selection(f A , S f ) 16: segment_lenght = S f + 1 segment video frames 17: for i = 1 to 10 do 18: Group(segment(i)) 19: end for 20: TH = 2ml(k) + 1.5(mg/sl(k)(ml(k))) adaptive threshold 21: if D(segment) > TH then 22: candidate segment 23: else 24: if (D(k) > 5D(k − 1))or(D(k) > 5D(k + 1))or(D(k) > (1.5mg)) then 25: candidate segment 26: end if 27: end if 28: if (candidate segment) then 29: Apply two-round of bisection 30: return Loc candidate 31: else 32: Discard the segment 33: end if 34: end function 35: function Cut_Transition_Detection(FV n , N gt ) 36: predict label ← svmpredict(FV n ,model) 37: if (predict label = N gt ) then 38: Declare CTR 39: return CTR 40: end if 41: end function will highly reduce computation. The experimental results are performed using MATLAB on a Windows 10 PC with an Intel Core i7 2.4 GHz CPU and 16 GB of RAM.

A. EVALUATION OF FRAME ACTIVE AREA SELECTION
For each video frame of size N 1 ×N 2 , three cases of the frame active area selection are given in Table 2, and the resultant frame has the size N A1 × N A2 .
For the frame active area experiment, we chose the TRCE-VID 2005 dataset that consists of twelve videos. Three cases for frame active areas shown in Figure 6 are tested with the highest energy coefficients.   For STKP, to select the high-order coefficients, we implement the following steps. 1) We compute STKP using Equation (3) for each frame's dimension (N 1 and N 2 ). The results are two polynomials of size equal to N 1 × N 1 and N 2 × N 2 , respectively. 2) For STKP, the priority of the selection order in (n-direction) is n = 0, N − 1, 1, N − 2, . . . , N /2 + ord/2 − 1, N /2 − ord/2. Therefore, we select the coefficients row-wise for each generated polynomial such that the selected coefficients equal the required order. Figure 7 shows the highest energy coefficient generation. The selection of these coefficients will reduce the computation cost because using the entire moment coefficients in the next computations is not needed. The three cases for frame active area are tested with 6%, 12%, and 25% of the high energy moment's coefficients. The execution time is computed for each case and repeated three times for accuracy appraisal. Table 3 shows the average execution time for the frame active area cases. From Table 3, it can be inferred that as the size of the frames increases, the time required for the processing is increased with comparable visual material, as shown in Figure 6. The results show the superiority of Case2 in terms of average time for different values of moments coefficients percentages (6%, 12%, and 25%). In addition, it is clear that when the selected moment order is increased, the computational cost will also be increased because the size of the OP matrix is increased in the n-direction. Therefore, from the results, the lowest total execution time for Case2 is 1125.11 Seconds when the selected moment coefficients is 6%. On the basis of the obtained results, Case2 with a moment order of only 6% will be considered in the next experiments.

B. EVALUATION OF CANDIDATE SEGMENT SELECTION
The parameters have a considerable effect on candidate segment selection and detection performance. Therefore, the selection of the threshold criteria and the bisection factors are to be suitable for complex video content. Decreasing the threshold parameters will lead to considering segments with lower distance as candidate segments. The number of S f in each segment is selected to adjust between detection accuracy and speed. The small numbers of frames will increase the distance calculations, thereby increasing the computation complexity. By contrast, large values will reduce the accuracy by incorrectly considering a non-boundary segment as a boundary segment. To specify the best S f , a matrix is formed with ones in the position of the frames in the candidate segments and zero elsewhere. Then, the matrix is compared with the CTRs in the ground truth. The comparison is demonstrated on two concepts: the frame percentage (FRP) and the transition accuracy percentage (TAP). Lower FRP values with higher TAP values are needed. FRP is defined as the number of frames that will be processed and defined as follows:  where F seg is the total number of frames in the candidate segments and N f is the total number of frames in the video sequence. TAP is the percent of correctly predicted transitions and it is computed as follows: where T correct is the correctly predicted transition and T groundtruth is the total number of transitions in the ground truth. Different values of S f are applied to the TRECVID 2005 dataset and the values of TAP and FRP are computed and recorded for each video, as shown in Table 4. Note that, only the adaptive threshold and the inequality criteria are used to select the best number of skipped frames. The results show that for small value of segment length, S f = 10, the processing time increases because the number of segments are increased; and thus, the distance calculation are increased. On the other hand, for large values of skipped frames, S f = 20, although the computation is reduced, some CTRs will be not recognized and judged as non-boundary. Hence, the TAP is reduced, and among the cases, the FRP shows the highest value of 23.92. The interesting finding is when S f = 14, a tradeoff between FRP and TAP occurs. Therefore, this value of S f (14) i considered the suitable selection because it affects the total number of frames and the accuracy of transitions detection. Moreover, for more clarification, the performance of candidate segment selection technique is evaluated using 48 videos from the TRECVID dataset. Table 5 shows the results of FRP and TAP of the proposed method. The results are computed first for the first two steps (the adaptive threshold and inequality criteria) of the candidate segment. Then, the bisection-based method is applied. Without the bisection comparison, the eliminated frames are 77% of the total frames. In other words, only 23% of the frames (475, 485 frames out of 2, 077, 260) are processed in the next stages. These processed frames are further reduced when applying the bisection, and the frame percentage to be processed is approximately 17% (Only 353, 134 frames needed to be processed from the number of frames 2, 077, 260) as reported in Table 1.
In conclusion, the results show a high TAP when applying the first two steps. Moreover, an upgrade in FRP when applying the bisection comparison occurs, eliminating further nonboundary transitions. To validate the superiority of the proposed scheme in terms of speed (FRP) and accuracy (TAP), the results of the proposed candidate segment selection technique is compared with two existing techniques, which are: the pre-processing method [22] and single-plane method [26]. The comparison is performed on the 2001, 2005, 2006, and 2007 datasets in terms of FRP and TAP for all methods, as shown in Table 6. For method in [22], it shows lower FRP rate than the proposed method; however, the TAP% is not satisfactory as many missed transitions occur in the candidate segments. By contrast, the method in [26] shows higher FRP rate and lower TAP% than the proposed method. For example, when considering the TREVID 2001 dataset, the proposed technique shows 5% fewer frames than the method in [26]. In other words, the frames are fewer by 5% × 97, 808 (from Table 1) = 4, 890 frames, and the TAP is higher. Therefore, the results show the superiority of the proposed method to the existing methods. For more elucidation, the average results of all datasets, which contains different video genres, are reported in Table 7. In addition, the improvement of the proposed method over the existing methods i also reported. From Table 7, the achieved improvement of FRP and TAP are 1.63% and 2.05%, respectively, which depicts that the proposed candidate segment selection technique is able to reduce the number of processes frames by 1.63% and increase the number of CTR transitions in the selected candidates by 2.05%, i.e. the trade-off is demonstrated by increasing the TAP with the reduction of FRP when the proposed technique is used.
Moreover, the proposed SBD method has several advantages over existing methods. First, the proposed SBD algorithm uses the active area for the frames in the candidate segment which in turn reduces computation cost. Second, the skipped frames are fewer compared to existing methods, in this case the number of processed frames is decreased and the computation cost will be reduced.

C. PERFORMANCE EVALUATION OF THE PROPOSED SBD ALGORITHM
For the SBD, the local features are extracted based on STKP with 6% of moment order. Then, the dissimilarity feature vector is formed with contextual information. We experimentally found that the values of pre = pos = 2 lead to a reasonable trade-off in results. The normalized feature vector is used as input in the SVM classifier that is trained with 50% training videos. The remaining videos of the datasets, 50% testing videos, are used for testing. Notably, the proposed SBD algorithm based on the proposed method overcomes the imbalance problem. This problem occurs due to the instability between the two classes [51]. The non-transition frames are more than the transition frames. By using the candidate segments with the bisection, many non-transition frames are excluded. Therefore, a balance occurs between the transition and the non-transition classes. The performance of the SBD is assessed in terms of computation cost and CTR detection. The evaluation metrics are precision (P), recall (R), F score , and computation time. These metrics can be defined as follows [7]: where N cd , N td , and N gt are the correctly detected, totally detected, and ground truth transitions, respectively. Note that the computation time is reported for entire stages of the proposed SBD algorithm. Table 8 summarizes the performance accuracies and the computation time of the SBD algorithm. The results demonstrate that the CTR scheme provides good detection performance for P and R. The improvement in P rate means the reduction in false detected transitions. The improvement in R rate is from the correctly detected transitions. Good F score results, which is the harmonic average of R and P, are obtained. Therefore, promising detection performance can be obtained regardless of the type of video dataset. For more clarification, the confusion matrices for the datasets used in the experiment are depicted in Figure 8. The robustness of the obtained results is verified by a comparison with the state-of-the-art SBD algorithm proposed    Table 8. in [7]. Table 9 shows a comparison of the accuracy measurements and computation time. The results indicate an excellent runtime of the proposed algorithm by reporting a 38.17% improvement over the SBD algorithm. This improvement is related to the high reduction in the number of processed frames. However, the proposed algorithm shows slightly less accuracy than SBD does because of the number of features that are used. The proposed algorithm used only one feature based on the moment of STKP, but the SBD algorithm used three features based on embedding operators. Overall, by using the proposed algorithm, the computational cost is highly reduced with acceptable detection accuracies. Therefore, the proposed algorithm is superior to the state-of-the-art SBD algorithm.
To prove the efficiency of the proposed algorithm, additional comparisons are performed between the proposed algorithm and different existing algorithms. These algorithms are: SBD algorithm concatenated block-based SBD algorithm (CBB-SBD) [54], Walsh-Hadamard transform-based SBD algorithm (WHT-SBD) [11], SBD algorithm based on convolutional neural networks (CNN-SBD) [28], and SBD using Non-Subsampled Contourlet Transform (NSCT-SBD) [14].  Table 10 reports a comparison between the proposed algorithm and existing algorithms using the TRECVID 2007 dataset. From Table 10 it can be seen that the highest F score is reported for the WHT-SBD algorithm which is 97.42%. While the reported F score for the proposed algorithm is 96.97%, which is only 0.45% less than the reported F score of the WHT-SBD algorithm. However, regarding the computation time, the proposed algorithm offers a noticeable reduction when compared with the existing methods. The computation time reported for the proposed algorithm is 66.01 Secs which represents an improvement of 11 times than the CNN-SBD algorithm.
Another comparison is reported in Table 11 between the proposed algorithm, CBB-SBD algorithm, DeepSBD, and TSSBD using the TRECVID 2005 dataset. From Table 11, it is clear that the CBB-SBD algorithm outperforms the proposed algorithm by 0.26% only. However, for the computation time required to detect transitions, the proposed algorithm outperforms the CBB-SBD algorithm with an improvement of 190. In addition, the proposed algorithm outperforms the DeepSBD and TSSBD algorithms in terms of transition detection accuracy.
From the reported results it can be observed that the proposed algorithm shows a remarkable improvement in time to detect transitions, with a slight decrease in the accuracy. The proposed algorithm reduces the computation time remarkably (see Tables 9, 10 and 11) because of: 1) the number of frames processed is reduced using the developed segment selection technique, 2) the size of the frame which is reduced using the frame active area selection technique, and 3) the less number of moments (features) used to represent the frame.

VI. CONCLUSION
In this study, a fast video processing method for SBD based on frame active area and candidate segment selection technique is proposed. The frame active area is considered to preserve valuable visual information and provide the first step in computation time reduction. The candidate segment selection is implemented to reduce the computation for the successive stages by eliminating the non-boundary frames. To perform the elimination, adaptive threshold, inequality criteria, and two rounds of bisection comparisons are implemented as pre-processing module. Therefore, the proposed method demonstrates superior performance in terms of video processing speed and accuracy. The SBD application is implemented with the proposed method to detect CTR. STKP is used to extract SBD features and the transitions are detected using supervised machine learning. Compared with existing methods, the proposed method achieves remarkable results in terms of TAP and FRP by trade-off. Furthermore, an excellent reduction in computational cost is reached. For future effort, the proposed method is examined with several features to improve detection accuracy. In addition, in our future work, our goal is to generalize the proposed method to detect different types of shot transitions. Furthermore, different applications, especially for video processing and multimedia analytics, will be implemented based on proposed method.