iCUS: Intelligent CU Size Selection for HEVC Inter Prediction

The hierarchical quadtree partitioning of Coding Tree Units (CTU) is one of the striking features in HEVC that contributes towards its superior coding performance over its predecessors. However, the brute force evaluation of the quadtree hierarchy using the Rate-Distortion (RD) optimisation, to determine the best partitioning structure for a given content, makes it one of the most time-consuming operations in HEVC encoding. In this context, this paper proposes an intelligent fast Coding Unit (CU) size selection algorithm to expedite the encoding process of HEVC inter-prediction. The proposed algorithm introduces (i) two CU split likelihood modelling and classification approaches using Support Vector Machines (SVM) and Bayesian probabilistic models, and (ii) a fast CU selection algorithm that makes use of both offline trained SVMs and online trained Bayesian probabilistic models. Finally, (iii) a computational complexity to coding efficiency trade-off mechanism is introduced to flexibly control the algorithm to suit different encoding requirements. The experimental results of the proposed algorithm demonstrate an average encoding time reduction performance of 53.46%, 61.15%, and 58.15% for Low Delay B, Random Access, and Low Delay P configurations, respectively, with Bjøntegaard Delta-Bit Rate (BD-BR) losses of 2.35%, 2.9%, and 2.35%, respectively, when evaluated across a wide range of content types and quality levels.

Recent advancements in multimedia technologies that span across Consumer Electronics (CE) in video content capturing, transmission and display have made video data the most frequently exchanged type of content over the modern communication networks. The increasing mobile consumption of High Definition (HD) and Ultra High Definition (UHD) video contents has contributed immensely towards the evergrowing IP video traffic and it is expected to reach over 82% of the overall Internet traffic in 2021 [1]. However, the estimated growth in network bandwidth (1.9 fold from 2017-2022, which is 39.0 Mbps to 75.4 Mbps for fixed broadband [1]) with time is still not sufficient to cater for the ever-growing user demands. Furthermore, the video requirements for emerging applications such as Augmented Reality (AR)/Virtual Reality (VR), interactive television, multi-party The associate editor coordinating the review of this manuscript and approving it for publication was Victor Sanchez . video conferences and over-the-top (OTT) multimedia consumption demand continuous improvements in the compression efficiency [2].
In this regard, High Efficiency Video Coding (HEVC) which was introduced in 2013 is the most recent stable video coding standard. It provides greater compression efficiency through an assortment of new features and coding tools over its predecessor H.264/AVC [3]. Out of these, the hierarchical quadtree partitioning structure introduced in HEVC that entails a wide range of Coding Unit (CU) sizes (i.e., 8 × 8 to 64 × 64) and their combinations, is one of the important contributors towards HEVC's improved coding performance. However, at the same time, it is also a major source of the complexity within the HEVC architecture [4], [5]. In addition, the brute-force Rate-Distortion (RD) optimisation followed by the encoder to determine the best coding mode configuration and quadtree partitioning structure for a given content, is considered as a main reason for the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ increased encoding complexity. For example, a mere increase of maximum CU size from 16 × 16 to 64 × 64 results in an average encoding time increase of 43% [4]. Hence, it is often identified that the complexity of the coding tools introduced in HEVC would require significant improvements to make them operational in real-time [5]. This is further corroborated in the recent efforts by Motion Pictures Expert Group (MPEG) to standardise several approaches for low complexity video encoding illustrating the importance of making the video coding algorithms less complex and less resource-intensive [6]. The recent literature identifies numerous mechanisms to reduce the HEVC encoding complexity by reducing the time taken by the RD optimisation to select the optimal coding modes and quadtree structure (in this case CU size) for a given video content [7]. The state-of-the-art fast encoding methods in this domain can be broadly categorised into two main approaches; rule-based methods and learning-based methods. Rule-based methods typically utilise the depth correlation of spatial and temporal blocks, RD cost statistics of previous CUs, and prediction modes [8], [9] to generate a fixed set of rules to determine the optimal CU size for a given content. However, vast differences in video characteristics and the dynamic nature of the video contents make it difficult to maintain a fixed set of rigid rules to determine the optimal CU sizes for particular video content. On the other hand, learningbased approaches often utilise machine learning methods with pre-trained models generated from vast amounts of data from previously encoded sequences [10]- [14]. Yet, the use of fixed rigid models makes them less adaptive to the changing properties (i.e., texture and motion characteristics) of natural video sequences. Algorithms that use online training [15], [16] facilitate generating content-adaptive dynamic models, but at the same time suffer from lack of training data for certain features making the decisions unreliable. Therefore, it is highly beneficial to investigate dynamic and flexible encoding algorithms that make use of the advantages of both offline and online trained prediction models to reduce the computational complexity of HEVC encoding while retaining the coding efficiency intact.
To this end, this paper proposes iCUS; an intelligent CU split decision making algorithm, that takes a hybrid approach by utilising offline-trained support vector machine (SVM) models together with Bayesian statistical models that track the probability of occurrence of features in the current video being encoded. The experimental results reveal that the proposed algorithm achieves a significant encoding time reduction performance compared to the HM16.8 [17] implementation and the state-of-the-art algorithms, with a minimal impact on the coding efficiency, for a variety of content types.
The remainder of this paper is organised as follows. Section I provides an overview of the HEVC block partitioning, CU size selection process, and the existing work in the literature on computational complexity reduction of HEVC inter-prediction. The CU split likelihood modellings using SVMs and probabilistic models are discussed in Sections II and III, respectively. The proposed encoding complexity reduction algorithm that uses both SVM and probabilistic models is described in Section IV. Experimental results are discussed in Section V and finally, Section VI concludes and discusses the potential for future improvements.

I. BACKGROUND AND RELATED WORK
This section first provides an overview of the hierarchical quadtree partitioning employed in the block-based HEVC encoding architecture. Next, state-of-the-art methods which focus on reducing the resulting encoding complexity due to the brute-force RD optimisation are discussed.

A. BACKGROUND
HEVC utilises a block-based partitioning structure to determine the encoding parameters for a given content. In this case, a frame from a sequence is initially divided into blocks called Coding Tree Units (CTUs), which are the basic units of partitioning in HEVC. A CTU can have a maximum size of 64×64 and it can be recursively sub-divided into four equally-sized smaller blocks called Coding Units (CUs) that can have sizes varying from 64 × 64 to 8 × 8. As an example, Fig. 1 shows the partitioning structure for a frame from the Johnny video sequence, where the frame is divided into CTU and CU blocks. Fig. 2(a) depicts how a CTU can be recursively subdivided into CUs, and the corresponding quadtree structure of the CTU is depicted in Fig. 2(b). A CU can contain one or more Prediction Units (PUs) and Transform Units (TUs) that contain prediction and transform information, respectively [4]. In general, an HEVC compatible encoder goes through a process of evaluating all possible combinations of the quadtree structure to determine the coding parameter combination that achieves the best coding efficiency performance for a given content. In the HM [17] reference encoder, this is achieved by an RD optimisation process and obtaining the coding parameter combination that gives the minimum RD cost computed by evaluating the Lagrangian optimisation cost function given by, where λ ≥ 0, and p * denote the Lagrange multiplier and the optimal coding parameters from the set of all coding options (A) considered for the minimisation, respectively. The terms D(p) and R(p) denote the distortion and the rate associated with the p set of coding parameters, respectively [18]. This brute-force approach of evaluating all possible combinations of coding parameters when determining the optimal coding structure enables HEVC to achieve very high coding efficiency. However, it adds an immense computational complexity burden to the encoder, resulting in excessive encoding time and energy consumption. As a result, numerous algorithms are proposed in the recent literature to reduce the encoding time complexity of HEVC while keeping the coding efficiency unscathed. Detailed encoder profiling results suggest that a large proportion of the overall encoding time is spent at the CU level to determine the best CU structure for a given content [5]. Hence, the following subsection analyses some of these fast encoding algorithms that are aimed at minimising the encoding time spent at the CU size selection stage within the encoding pipeline.

B. RELATED WORK
In HEVC, inter-prediction accounts for the highest portion of the encoding time [19]. The encoding time complexity in HEVC inter-prediction can be reduced by introducing changes to a range of operations within the HEVC architecture. For example, optimisation attempts in motion estimation, in-loop filtering, and coding structure selection processes are reported in the recent literature. The complexity profiling results in [5] and recent literature report that the encoding time gain that can be achieved from the optimisation of motion estimation and loop filtering processes is minimal. This is further corroborated by the experimental results presented in [20] and [21] for their fast motion estimation methods that report ≈ 20% encoding time reductions. Similarly, fast Sample Adaptive Offset (SAO) parameter estimation algorithms proposed in [22] and [23] are limited to average encoding time gains that are in the range of 42% − 45%.
On the other hand, recent literature reports that expediting the HEVC quadtree size selection process can lead to much higher overall encoding time reductions. In addition, such fast motion estimation and in-loop filtering algorithms are not mutually exclusive and can be utilised in conjunction with fast coding structure selection algorithms, resulting in much higher overall encoding time reductions. The algorithm proposed in this paper falls into the category of fast CU size selection methods, thus, the following subsection mainly discusses the state-of-the-art work that focuses on reducing the encoding time complexity of the CU size selection process within the HEVC architecture. The state-of-the-art work that targets fast CU size selection can be broadly categorised into two areas; rule-based approaches and learning-based approaches. The following sub-sections provide summaries of the related work that fall into each of these two categories.

1) RULE-BASED METHODS
In general, rule-based methods utilise statistical inferences based on the video content and data gathered during the encoding process to generate a fixed set of rules to determine the best CU size for a given content. In this context, the use of features extracted from the encoding loop such as RD cost, Skip mode, and Merge mode, is one of the popular approaches in the recent literature. For instance, Lee et al. [24] use previous RD cost details of the inter 2N × 2N PU mode together with Skip and Merge modes to estimate the CU size decision of the current CU. The algorithm, however, demonstrates a considerable variance in the encoding time reduction with the Quantisation Parameter (QP) and content type. For example, encoding complexity reduction achievable is minimal with highly textured and complex sequences, especially at lower QPs, where Skip mode is less significant. Similar behaviour is observed in the algorithm proposed by Vanne et al. [25], which skips the evaluation of certain PU modes depending on the Skip mode decisions at the current CU depth level. The algorithm proposed by Choi and Jang [26] follows a similar approach, but achieves a limited encoding time reduction compared to similar approaches.
HEVC reference encoder implementation (i.e., HM Test Model 16.8 [17]) comes with built-in fast encoding options that also utilise encoding information extracted within the encoding loop. These include Early CU (ECU) termination [27], Early Skip Detection (ESD) [28], and Coding Flag Mode (CFM) [29]. ECU approach terminates further recursive splitting of a CU, if the best mode for the current CU depth is determined to be the Skip mode. In the ESD method, Skip mode detection is further expedited if the motion vector difference and coded block flag of Inter 2N × 2N are identified as zeros. In CFM, when the coding block flag of the current PU is zero, the subsequent PU mode evaluations are bypassed. Deficiencies arising from the rigidness of the decision rules are visible in these algorithms when encoding highly complex video sequences.
Characteristics of the content being encoded are also often used as features in the algorithms that estimate the CU size and coding structures for a given video content prior to the encoding process. For instance, Pyramid Motion Divergence (PMD) features calculated from the optical flow of down-sampled frames are used in [9] to early determine the CU size. However, computation of optical flow itself within the encoding loop is considered computationally expensive VOLUME 8, 2020 and resource-intensive. Similar research is presented in [30], [31] that use motion data for early CU/PU decisions.
Following a different approach, Shen et al. [8] utilise previously encoded neighbouring and co-located CU information to skip the evaluation of unnecessary CU depth levels of the current CU. Such algorithms demonstrate significant degradation in the RD performance due to sub-optimal decisions and subsequent error propagation, especially in the case of video contents with complex textures and high motions [15], [32].
Similar methods that use statistical models generated from data within the encoding loop [8], [9], commonly rely on rigid and less flexible rule-based models, which cannot adapt to the dynamic changes of the video content. That being said, Mallikarachchi et al. [15] propose a content-adaptive CU size selection algorithm that utilises two CU classification models generated from the data collected online during the encoding phase. These, together with the moving window based feature selection process, ensure that the CU decisions obtained are relevant and adaptive to the content being encoded. However, the lack of data points for certain features within the limited window size makes certain CU split decisions unreliable, leading to coding losses.

2) LEARNING-BASED METHODS
On the other hand, learning-based methods utilise machine learning approaches to build models and to tune hyperparameters based on historical data collected during the video encoding process. SVM-based methods are commonly used in the literature for CU split decision prediction applications. For example, Grellert et al. [10] utilise features from the current CU depth level to construct SVM models to determine whether to early-stop the recursive CU split process. However, the method shows deficiencies for complex sequences encoded at low QPs where high CU depths are common. In this context, the evaluation of current CU depth level becomes ineffectual if the CU is decided to split further, which is often the case with complex sequences encoded at low QPs. The complexity reduction method proposed by Shen and Yu [12] makes use of Inter 2N × 2N RD cost and Skip/Merge data as features for the offline-trained SVM models. This method relies on the coding information of the current depth level. Hence, the time gains are hindered when the CU is decided to split, due to the redundant calculations at the current CU depth level.
The use of decision trees for CU size selection is considered in the algorithm proposed by Correa et al. [13]. In addition, Xu et al. [14] propose a Convolutional Neural Network (CNN) based approach for HEVC complexity reduction. However, the rigidness of the models in these approaches is seen as a common drawback which makes the CU decisions less adaptable for dynamics of the video content. Li et al. [33] also propose a complexity reduction algorithm using CNNs, yet, the anticipated time gains are relatively lower in this approach. CNNs are widely used in HEVC encoding to reduce the complexity of intra-prediction operations (e.g. [34], [35]). Nevertheless, these methods only rely on spatial information, which becomes less useful in the context of inter-prediction that operates in the temporal domain.

II. CU CLASSIFICATION FOR SPLIT LIKELIHOOD USING SVM
This section first introduces the CU split likelihood modelling followed by a description of iCUS, the SVM based CU classification model proposed in this paper.

A. CU SPLIT LIKELIHOOD MODELLING
In HEVC, the basic processing unit is a CTU, which has a maximum size of 64 × 64. During the encoding process, any frame in the sequence is initially divided into CTUs. Subsequently, each CTU is recursively sub-divided into four equal-sized blocks called CUs, that have sizes varying from 64×64 to 8×8, also identified as CU depth levels from 0 to 3, respectively. During the RD optimisation process, all possible PU modes for a given CU size are evaluated to determine the combination of partitioning structures and prediction modes that gives the best RD cost. Therefore, the final partitioning structure of a CTU is determined by the RD costs at each CU depth level. For a given CU depth level (say i), if the RD cost at depth i is higher than that of the depth i+1, the CU is marked as split and it is marked as non-split otherwise. Therefore, the decision for sub-division at each CU depth level can be modelled as a binary classification problem, with classes y ∈ {+1, −1} where y = +1 represents a split and y = −1 represents a non-split decision. This can be given by, where C 0 and C s (s ∈ 1, 2, 3, 4) denote the RD cost of the current CU in the 2N × 2N mode and the RD cost for a sub CU, respectively.
Binary classification problems can be modelled using various machine learning techniques such as Naïve Bayes [36], decision trees [37], neural networks [38], SVMs [39] and logistic regression [40]. However, the time taken by the model's inference process to predict the split decisions is very crucial to achieve higher encoding time complexity reductions. Hence, in this paper, the classification problem in (2) is modelled using SVMs due to their ability in handling binary classification problems with significant computational advantages [41].

1) SUPPORT VECTOR MACHINES (SVM)
SVMs are a supervised machine learning algorithm that is heavily used in classification, regression, and outlier detection problems. They first determine a separating hyperplane for the training data that maximises the separation among classes, known as the margin. Once the hyperplane is established, SVMs can predict the class label for a given feature vector from unseen data depending on the location of the new data item, relative to the hyperplane. For example, a training dataset T r of size n where there are two classes with labels +1 and −1 can be represented as, Here, x i is a p-dimensional input vector, and y i is the corresponding class to which each x i belongs. During the training phase, SVM constructs a hyperplane that represents the largest separation of the dataset (known as the margin) in a higher-dimensional space, solving the primal optimisation function defined as, where w, ξ and C > 0, are the normal vector to the hyperplane, slack variable and regularisation parameter, respectively. Here, φ(x i ) acts as the Kernel function that can map x i into a higher-dimensional space in the case of non-linear boundaries among the classes. Once the hyperplane is established, the classifications for unseen samples are obtained using the decision function, Fig. 3 graphically illustrates a sample dataset with two classes and the corresponding optimal hyperplane (in black) that achieves the maximum possible margins with the data points of the two classes.

2) FEATURE SELECTION
To use SVMs as the CU classification model, it is crucial to scrutinise the HEVC encoding process to identify the features that closely contribute to the CU split decision. Besides, it is important to identify and construct a dataset that is required for the training phase of the models. In this context, numerous features and parameters that can be extracted from the encoding loop per CU depth are considered, and their relative importance towards the CU split decision is analysed. The feature selection is carried out using a set of video sequences that is listed in Table 1. These sequences are carefully selected to represent a wide range of characteristics in video contents such as different texture complexities and motion details. Twenty frames from each sequence are encoded under fixed QP settings such that the QP ∈ {22, 27, 32, 37}, using the HM16.8 reference software [17]. Then, a set of features that encompass context, texture, and coding information of the current CU are extracted along with the CU split decision under RD optimisation to form the training dataset. Table 2 illustrates the set of features considered during the feature extraction step. The relative importance of subsets of these features to the CU split decision has been identified in the literature as well as in our previous work [10], [13], [15]. A brief overview of the features summarised in Table 2 is given below.

a: CONTEXT INFORMATION
The context information defines the features that can be extracted from the previously encoded neighbouring and co-located CTUs of the current CTU. For each neighbouring and co-located CTU, the information extracted as the maximum PU depth, average PU depth, number of bits consumed for the block, RD cost, and resulted distortion in the block, are considered as the dominant features.

b: TEXTURE INFORMATION
Texture information for the CTU is extracted from the raw pixel values of the current encoding block. In this case, statistical information of the image block extracted by computing the Grey Level Co-occurrence Matrix (GLCM) [42] is used. GLCM is defined over an image as a matrix that represents the distribution of co-occurring pixel values. For example, for a m × n image with p distinct pixel values, the co-occurrence matrix C can be given as, where I is the image, i and j are pixel values, x and y are the spatial positions in the image I , and ( x, y) are the offsets in the horizontal and vertical directions in the image. Following the standard co-occurrence matrix definition [42], GLCM is created by calculating how often a certain pixel value i occurs horizontally adjacent to a pixel with value j. Therefore, the offsets x and y are defined as 1 and 0, respectively. Once the co-occurrence matrix is determined, the texture features representing the homogeneity, contrast, and local range data are computed for the current image segment that is being encoded. In this case, the contrast (ϒ) and the homogeneity ( ) can be defined by, and respectively, where C denotes the co-occurrence matrix while i and j are pixel values for which the GLCM is defined.

c: CURRENT CU INFORMATION
This category corresponds to the information that is available for the block that is currently being encoded. These include QP, RD cost, and distortion of the 2N × 2N PU mode. In this case, the proposed model is applied to the CU once the initial 2N × 2N PU mode is evaluated for the current CU. Hence, the resulting RD cost and distortion for the 2N × 2N PU mode are available for the inference process of the proposed algorithm ( Fig. 5 in Sec-IV). The features identified in Table 2 carry different weight contributions towards the CU split decision. Therefore, F-score for each feature at each CU depth level is calculated, in order to identify the features that contribute the most towards the split decision of the given CU [43]. In this case, for the binary classification problem with classes +1 (positive) and −1 (negative), F-Score for a given feature i is calculated as,    The F-Score values computed for the features in each depth level are presented in Tables 3-5. Higher F-score value of a feature indicates that its contribution towards the CU split decision is significant. Hence, these F-Score figures are used to limit the number of features actually selected per CU depth level when generating the classification models. This ensures that the computational complexity of the inference process is maintained low while retaining the accuracy of the prediction models. In this regard, the number of features used at each CU depth level of the proposed algorithm is empirically decided to be four, which has given a balanced trade-off between model accuracy and computational complexity. In addition, the feature selection ensures only one of average or maximum depth of neighbouring CTUs is included in the final feature list to ensure that multiple aspects of the CU are taken into consideration for the models. The final sets of selected features for each CU depth level are depicted in Table 6.

B. TRAINING OFFLINE MODELS
Once the features are finalised and the dataset is prepared, the models are trained offline. In this case, several hyperparameters specific to the SVMs are determined as follows. First, the kernel function for the SVMs is determined as Radial Basis Function (RBF) as it demonstrates capabilities to handle SVM decision boundaries that are non-linear and is shown to efficiently work with small numbers of features [45], both of which are crucial for the proposed algorithm. Next, hyper-parameters associated with the SVMs are selected by following a grid search across a range of possible values as proposed in [46]. In this case, the hyperparameters of the SVMs include cost (C), gamma (γ ), and the number of training samples (N ). Table 7 shows the range of values evaluated in the grid search to estimate the optimal values for the respective parameters. For parameters C and γ , a step size of 2 between the upper and lower bounds is used. The combination of hyper-parameters that provide the highest accuracy for the SVM models is used for model training. In this case, the accuracy (ζ ) for each combination of hyperparameters is evaluated using, where T + , T − , and N tot denote the numbers of true positive samples, true negative samples, and total predictions, respectively. Training and cross-validation datasets for hyper-parameter tuning are formed by splitting the data gathered during the feature selection process. In this case, datasets are divided to ensure that there are equal numbers of samples from both split and non-split classes in the training sets, because SVMs perform poorly for unbalanced data [47]. The number of training samples used in this hyper-parameter tuning stage varies from 400 up to 1200, with a step size of 100.
In general, higher numbers of training samples result in higher accuracy, yet at the same time increase the number of support vectors. This leads to an increase in the prediction time, which ultimately affects the encoding time gains that can be achieved. Therefore, the maximum number of training samples considered for this grid search is empirically determined to be limited to 1200 in this work. The upper bound of 1200 ensures that the maximum possible number of support vectors is limited to 1200. A cross-validation set is defined with the size of 600, that encompasses 300 samples for each split and non-split classes.
The hyper-parameter value combination that results in the highest accuracy is selected as the optimal combination. This search is conducted for all depth levels, i.e., depth 0 to 2, and Table 8 shows the selected hyper-parameter values for each depth.

C. COMPUTING THE CU SPLIT DECISION
The posterior probabilities of the split and non-split classes, calculated during the SVM prediction, are checked to predict the class for the new instances. Then, the CU split decision from the SVM prediction models (O SVM ) is determined as, where p + and T svm denote the posterior probability determined by the SVM model for the split class and the SVM decision threshold, respectively. By default, the probability threshold (T svm ) is set to 0.5. However, increasing T svm on the other hand increases the confidence of the split decision taken in (11). For example, if the decision to split a CU is taken with higher confidence (i.e., higher posterior probability for class split), the decision is more likely to be accurate. In this case, the proposed algorithm offers the flexibility to change T svm to trade-off the encoding complexity reduction to the coding efficiency. If the p + probability does not exceed the T svm threshold, the CU split decision is handed over to the traditional RD optimisation. Hence, T svm acts as a design parameter that controls the number of CU split decisions taken by the SVM models and RD optimisation. Increasing T svm leads to increases in the number of CU split decisions taken by RD optimisation resulting in less encoding time reductions. Yet, at the same time, this ensures that the impact on the coding efficiency is minimal as the less confident SVM decisions are discarded in the process.   Table 6 and CU split decision for each depth d ∈ 0, 1, 2. The histograms depict the number of occurrences of each feature for split and non-split CUs for QP=24.

III. CU SPLIT DECISION CLASSIFICATION USING PROBABILISTIC MODELS
The SVM models discussed in Sec. II are generic models generated by performing an offline training process on the data collected offline. However, it has been shown that offline trained generic models may not perform well with the dynamics of the video content. Thus, content specific CU split likelihood models are important to achieve a higher prediction accuracy [48]. Therefore, this section describes an online prediction model (that acts alongside the offline SVM prediction models) that can keep the CU split decision prediction content-adaptive.
The CU split likelihood can be modelled as a Bayesian probabilistic model [15], [36]. In this case, the posterior probability of whether a CU is split at a particular depth level d is given by, where P(S d ), is the prior probability of a CU being split at depth d, P(X d ) is the marginal likelihood of observing a feature vector X d in the feature space at depth d, and P(X d | S d ) is the likelihood of a feature vector X d , given the CU is split at depth d.
In this case, the features that have been identified in Table 6 are taken as inputs to X d . The relationship among the features in X d and CU split decision is complex and content-dependant as depicted in the histograms in Fig. 4. The graphs show the impact of a feature vector X d towards the CU split decision for a given QP.
The data for the probabilistic model given in (12) are accumulated during the encoding process. The initial frames of the sequence are encoded using traditional RD optimisation until N p number of data points are collected for each feature vector at each CU depth level. Once the data points are accumulated, the decision to split or not to split a CU is now obtained by comparing (12) with an empirically determined threshold T b . Hence, the CU split decision for a CU at depth d is given by The threshold T b acts as a switch that decides either to split or not to split the CUs. Empirical observations reveal that the value of T b impacts both the bit rate and the quality. A large T b results in less CUs being split whereas a smaller T b generally forces more CUs to split. In this case, T b becomes an important parameter that can trade-off between coding efficiency and encoding complexity. Thus, T b is maintained as a design parameter that can be preset to achieve the desired trade-off of the quality and the bit rate.
The same set of features are taken for both offline trained SVMs and online built probabilistic models. The relationship among the features in X and CU split decision is complex and content-dependant as depicted in the histograms in Fig. 4. The graphs show the impact of feature value to the CU split decision for all features for a given QP value. For the data depicted in the graphs, the QP has been fixed at 24.

IV. PROPOSED METHOD
This section describes the overall CU size selection algorithm that makes use of the two CU split likelihood modelling methods described in Sec. II and Sec. III.
The SVM and Bayesian prediction models discussed in Sec. II, and Sec. III, respectively, demonstrate different characteristics in their performance in terms of encoding time reduction and coding efficiency. For example, offline-trained models (such as SVM based model described in Sec. II) generally achieve high computational complexity reductions compared to the online-trained models (such as probabilistic model described in Sec. IV). The time spent during the encoding process to collect sequence-specific data hinders the encoding time reduction achieved in the latter. On the other hand, online-trained probabilistic models demonstrate less Bjøntegaard Delta Bit Rate (BDBR) [49] losses since the CU split decisions made in general are content-adaptive and relevant to the video content that is being encoded [48]. In this context, an experimental sweep is conducted using a set of training sequences to evaluate the impact of each CU split decision estimation model on each CU depth level. Table 9 depicts the encoding time performance achieved against the BDBR loss at each CU depth level for each prediction model. The experiment is carried out by enabling only the SVM or the probabilistic model at a particular depth level while keeping the RD optimisation for the remaining depth levels to make the split decision. The set of video sequences (BasketBallPass, Cactus, Kimono, PartyScene, and Race-Horses) used for this experiment is chosen such that it is a representative of a wide diversity of video characteristics. T svm used in this experiment varies from 0.1 to 0.9 and T b varies from 0.5 to 0.9, where a step-size of 0.1 is used for both parameters.
The experimental results in Table 9 show that at depth 0, SVM prediction model yields a slightly higher encoding time complexity reduction when compared to the Bayesian probabilistic model, but with much less BDBR loss. Similarly, at depth 1, the SVM prediction model achieves a low BDBR loss difference at higher time complexity reduction when compared with the probabilistic model. Based on this, the proposed CU size selection algorithm is defined to use the offline trained SVM prediction models for the CU depth levels 0 and 1 to determine whether to split the current CU. Referring to the performance in Table 9, SVM model at CU depth level 2 shows higher BDBR loss compared to the probabilistic model for a similar encoding time reduction. Hence, the Bayesian probabilistic model is adopted at depth 2 to make the CU split/non-split decision in the proposed fast encoding algorithm.
In addition to the proposed offline-trained SVM and online-trained Bayesian prediction models, the status of the SKIP mode is also taken into consideration for the CU size selection. When the CU split decision is set to be taken by the RD optimisation, the encoder evaluates PU modes of both the current and the next depth levels. Once the calculations for the PU modes of the current depth level are complete, the best PU mode for the current CU size (i.e., depth level) is selected by the RD optimisation. If the SKIP mode is selected as the best PU mode for the current depth level, further splitting of the CU is terminated. The SKIP mode usually corresponds to a static image region where there is no residual [50] (i.e., the image block can be perfectly predicted from a previous image). Hence, further splitting the CU into sub-CUs is proven to be ineffectual [25]. A similar approach is adopted in the proposed algorithm to avoid any redundant sub-CU level PU mode evaluations, and thereby further reducing the encoding time. Fig. 5 illustrates the overall architecture of the proposed encoding algorithm that uses both the SVM and probabilistic CU split likelihood models to determine the CU split decisions.

V. EXPERIMENTAL RESULTS AND DISCUSSION
This section presents the experimental results of the proposed fast CU size selection algorithm for HEVC inter-prediction. The proposed algorithm is implemented in HM16.8 reference software [17] and the optimised SVM library libSVM [45] is employed in implementations of the SVM prediction models described in Sec.II. The impacts on the coding efficiency and encoding time gains of the proposed algorithm are compared with several state-of-the-art algorithms that include HM16.8 [17], SVM based CU size selection algorithm proposed by Grellert et al. [10], content adaptive fast CU size selection algorithms proposed by Mallikarachchi et al. [15] VOLUME 8, 2020  and Lee et al. [24], and fast PU size selection algorithm proposed by Vanne et al. [25].

A. EXPERIMENTAL SETUP AND ENCODING CONFIGURATIONS
The proposed algorithm is evaluated for a range of HD and UHD video sequences that consist of content types ranging from simple to highly complex motion with diverse spatial and temporal characteristics. The list of sequences used in the experiments and their abbreviations (as reported in the subsequent tables) are presented in Table 10. In this experiment, the video sequences, encoding configurations, and QP values are selected as defined in the HEVC common test configurations [51]. For instance, Low Delay B, Low Delay P, and Random Access configurations are used for the encoding with QP ∈ {22, 27, 32, 37}. All experiments are carried out in an environment consisting of an AMD 64-Core CPU fixed at 2.5 GHz system with a 64 GB RAM and running Ubuntu 14.04 64-bit operating system. The number of processes (i.e., encoding instances) executed in parallel at any given time is set to four in order to ensure that the system is not overloaded.
The impact on the RD performance (i.e., coding efficiency) is evaluated using the BDBR metric [49] and the average percentage encoding time saving, T , is evaluated for the proposed and state-of-the-art algorithms by, where T HM and T ρ denote the encoding times of HM reference software and each of the fast encoding approaches, respectively.

B. RESULTS AND PERFORMANCE ANALYSIS
This section first discusses the impact of the CU split decision thresholds utilised in the SVM and probabilistic prediction models described in Sec. II and Sec. III, respectively. This is followed by an analysis of the overall performance and the implications of using the proposed fast coding framework.

1) IMPACT OF THE MODEL THRESHOLDS
The ideal scenario for a complexity reduction algorithm is to obtain maximum time gain with minimum coding loss. However, as observed in Fig. 6, the time gain and the BDBR loss have a direct relationship, where BDBR loss increases with the increasing time gain, and vice-versa. These graphs correspond to the average encoding time gain and the BDBR loss data from a set of four video sequences (BasketballDrive, BasketbllPass, RaceHorses, and Vidyo) with the prediction models for each CU depth level being utilised as described in Sec. IV (i.e. SVM model is active for depth levels 0 and 1 and the probabilistic model is active for depth level 2). In this case, an experimental sweep is conducted such that T svm and T b is varied from 0.1 to 0.9 with a step size of 0.1. The resulting  encoding time reduction ( T %) and BDBR reduction are illustrated in Fig. 6. These thresholds in the proposed algorithm are defined as design parameters that facilitate the algorithm to trade-off the encoding time gain to the coding efficiency depending on the requirement. For the results presented in the subsequent sections, the two thresholds (both represented by τ )) are set as T svm = 0.7, 0.6 and T b = 0.7, 0.6.

2) OVERALL PERFORMANCE OF THE PROPOSED ALGORITHM
The performance of the proposed algorithm is presented in the Tables 11, 12, and 13 for the low delay B, random access and low delay P configurations, respectively. The impact of the video content, QPs, and other relevant attributes on the performance are analysed in the following discussion with additional tables. for the Random Access configuration. It can be observed that the proposed algorithm (iCUS) achieves significant and consistent encoding time gains for all QP values. However, an increase in the achievable time gain is noted with the increasing QPs. This can be attributed to the features selected in the SVM and probability models. A category of features used in the proposed algorithm (Ref. Table 2) corresponds to the information attained from neighbouring CTUs. This indicates that the CU split decisions are influenced by the split decisions of the neighbouring blocks. When the QP value is low, a picture is partitioned into more CUs, reducing the correlation among the neighbouring blocks. This is comparable to [25] that reports increasing time gains with increasing QP values. However, other state-of-the-art methods demonstrate considerably weak performance at lower QP values.

a: PERFORMANCE VARIATION WITH QP
In addition, the selection of SKIP mode as the best PU mode for a CU depth level is minimal when using lower QPs. Hence, it is noticeable that large CU depth levels (i.e., smaller CU sizes such as 16×16, 8×8) are frequently utilised within a frame when the QP is low. Hence, the achievable encoding time gain can be hindered for algorithms that utilise SKIP  mode as a decision making feature (i.e., [24]) when they are operating with lower QPs.
Furthermore, a similar observation can be made for highly textured and complex sequences such as PeopleOnStreet (S1) when using lower QPs for encoding. The complex textures and high motions result in a lower number of SKIP mode selections. Hence, the encoded bitstreams contain a large number of smaller CUs per frame, resulting in more decisions being taken by the RD optimisation compared to the proposed prediction models.
Following the same phenomenon, in the FourPeople (S9) video sequence, there is relatively lower motion with a large portion attributed to a homogeneous background. Therefore, a higher number of predictions can be made with the SKIP mode. Therefore, even at lower QPs, the method in [24] performs relatively better. However, the proposed method also achieves comparable results for all QPs in the sequence.

b: PERFORMANCE VARIATION WITH THE CONTENT
The observations in Tables 11,12,and 13 show that the proposed method achieves higher encoding time complexity reductions for sequences that have very limited motion (e.g. Johnny, Four People, and KristenAndSara). On the contrary, the sequences that have higher motion (e.g. RaceHorses, PartyScene, and BasketballDrill) achieve  [27], ESD [28], and CFM [29] (low delay B configuration). comparably less time gains. The high motion indicates more foreground motions (i.e., moving objects), leading to less correlation among neighbouring blocks and less occurrence in the selection of SKIP mode. To this end, iCUS decisions are influenced by both the neighbouring information as well as the SKIP mode, leading to more CUs being sent to RD optimisation, reducing the time complexity reductions that can be achieved.
Furthermore, the observed results show that for sequences with relatively higher motion (S1-S3), the proposed method achieves higher time gains when compared with the state-ofthe-art methods. In these sequences, there is a high correlation in the colocated blocks in the reference frames. Incorporating this information as features in the models (Table 6) enables the proposed method to achieve higher time gains. Due to the high texture granularity and relatively higher motion, the methods that use SKIP as the main feature (e.g. [24]) fail to achieve higher time gains. However, as a result of using RD optimisation in such cases lead to less impact to the BDBR. On the contrary, the proposed method has a relatively higher impact on the BDBR when the threshold is increased. However, at lower values of the threshold, the proposed method manages to outperform the state-of-the-art methods with higher time gains and relatively lower BDBR impacts.

c: PERFORMANCE VARIATION WITH THE ENCODING CONFIGURATION
From the results presented in Table 11, it is observed that iCUS with τ = 0.6 achieves significant time complexity reductions of 53.46% with BDBR losses of 2.35%. This outperforms [10] and [15] with higher time complexity reductions. Although [24] reports slightly higher time complexity reductions, this comes at a higher BDBR loss of 3.14%. Vanne et al. [25] algorithm records a similar time complexity reduction as iCUS with τ = 0.7. However, their algorithm lacks the flexibility to control the encoding time reduction to the coding efficiency achieved; a crucial improvement in the proposed algorithm compared to [25] and [24]. Table 12 depicts the results for the Random Access configuration. iCUS with τ = 0.7 outperforms [10], [15] and [25], in terms of time gains and BDBR losses. When τ = 0.6, iCUS achieves very high complexity reduction of 61.15%, at a much less and negligible BDBR loss of 2.9%, outperforming [24] both in terms of the encoding time complexity VOLUME 8, 2020 reduction and the BDBR loss. A similar observation can be made with the Low Delay P configuration results presented in Table 13, where the proposed algorithm with τ = 0.7 outperforms all state-of-the-art methods.
RD curves for two sample sequences, Johnny and Par-tyScene, are given for all configurations in Fig. 7. It can be observed that the RD performance of the proposed iCUS algorithm is similar to that of HM16.8, thus, the BDBR loss of the proposed method can be considered negligible.

d: IMPACT OF MODEL SELECTION ON THE ENCODING PERFORMANCE
As indicated in Sec. IV, the proposed algorithm utilises SVM prediction models for CU depths 0 and 1, whereas the CU depth 2 uses a Bayesian probabilistic model. The average experimental results presented in Table 16 summarise the encoding time and BDBR loss for the proposed algorithm when using different combinations of prediction models at each CU depth level. For instance, configurations defined in Table 16 correspond to A: using SVMs for all CU depths, B: using Bayesian models for all CU depths, C: using SVMs for CU depths 0 and 1, and RD optimisation for CU depth 2, and D: the proposed iCUS configuration. These experiments are conducted for the test video sequences that are not part of the initial training set given in the Table 1.
It can be observed that the configuration D, corresponding to using SVM models at depths 0 and 1, while using the probabilistic model at depth 2, gives the highest encoding time gain while also keeping the BDBR intact at an average of 0.62%. The impact of the rigidness of SVM models is evident in the performance of configuration A when all CU depth levels use the SVM models. The considerably low time gain in B suggests that building the Bayesian probabilistic models for all depth levels during the encoding time is inefficient from the perspective of time complexity. This is because the model keeps making RD optimisation based decisions until a sufficient number of data points are accumulated to to use the Bayesian model given in (13). Configuration C on the other hand also provides a considerable time gain with a marginal BDBR loss. It can be seen that the addition of Bayesian models to obtain the decisions for CU depth level 2 increases the BDBR loss by 0.09%, whereas the encoding time reduction has improved by a margin of ≈7%.

e: PERFORMANCE COMPARISON WITH THE HM16.8 BUILT-IN FAST ENCODING METHODS
The performance of the proposed algorithm is also compared against the fast encoding techniques that are embedded in the HM16.8 reference encoder implementation, that include ECU [27], ESD [28], and CFM [29]. Their encoding time performance, together with the coding efficiency impact, is illustrated in Table 14. In this case, the experimental results presented in Table 14 correspond to the low delay B main configuration which gives the minimum BDBR increase for the proposed iCUS algorithm (τ = 0.7). It can be observed that the proposed algorithm outperforms the individual builtin fast methods in terms of the encoding time reduction. Further, when all three built-in methods are simultaneously activated, the encoding time reduction achieved is similar to that of the proposed iCUS algorithm. However, the coding efficiency loss is much noticeable in the former compared to the latter. Furthermore, the ability of the proposed algorithm to trade-off the encoding time reduction to the coding efficiency loss is an extra benefit as opposed to the fixed time gains achieved when using the built-in fast encoding methods.
Finally, it should be noted that the computational costs associated with the SVM inference, data collection for the probabilistic models, and the decision-prediction are all included in performance calculations reported. Therefore, it is evident that the additional complexity overheads introduced by the proposed algorithm are negligible when compared with the significant time savings that can be achieved by incorporating these algorithms into the encoding cycle.

VI. CONCLUSION
This paper proposes an intelligent fast CU selection algorithm for HEVC inter-prediction. In this context, two split likelihood models (an offline-trained SVM model and an online-trained Bayesian probabilistic model) are introduced to predict the CU split and non-split decisions. The hybrid use of offline and online trained models for the CU size selection keeps the split/non-split decisions made during the encoding content-adaptive and accurate. Moreover, the flexibility of the algorithm to effectively trade-off the computational complexity to the coding efficiency is also investigated.
One of the main conclusions of this work is that the use of online models for CU size selection at specific CU depths keeps the algorithm content-adaptive. The data collection during the encoding phase ensures the decisions made by the algorithm are content relevant hence avoids increases in the BDBR losses. Experimental results reveal that the proposed algorithm achieves significant encoding time gains across all QP values, with slightly higher performance when using higher QP values. As a result, it can be concluded that the proposed algorithm can be used to reduce the encoding time complexity in applications that need a diverse range of video quality levels. Furthermore, configuring the threshold levels in the SVM model and conditional probability for a given feature vector allows the algorithm to control the percentage of CUs that are evaluated using traditional RD optimisation. This eventually facilitates the algorithm to trade-off the coding complexity to the coding efficiency.
In conclusion, the simulation results demonstrate average encoding time performances of 53.46%, 61.15%, and 58.15 % for Low Delay B, Random Access, and Low Delay P configurations, respectively, with negligible impacts on the coding efficiency, across a wide range of content types and quality levels. The future work will focus on extending the algorithm to address the partitioning of PUs and TUs. Furthermore, other lightweight neural networks will also be analysed to train models to predict the entire CU, PU and TU structure for a given block to achieve higher encoding time complexity reductions while keeping the coding efficiency intact.