Efficient Partition Decision Based on Visual Perception and Machine Learning for H.266/Versatile Video Coding

H.266/Versatile Video Coding (VVC) is the latest international video coding standard to encode ultra-high-definition video effectively. The quadtree with nested multi-type tree (QT-MTT) structure provides various sizes of coding tree partitioning and allows the nested binary tree (BT) split and ternary tree (TT) split at each QT level. Furthermore, numerous advanced coding tools are equipped in the H.266/VVC encoder. However, the encoding time increases tremendously. Previous researches regarding the fast coding algorithm of H.266/VVC seldom mention perceptual redundancy. This paper utilizes the human vision model of just noticeable difference to extract the visually distinguishable pixels that may affect the visual perception. We observe that the distributions acquired by the horizontal and vertical projections of visually distinguishable pixels within the coding unit are related to their corresponding MTT splitting modes. Therefore, the distributions representing the perceptual information of human vision are used to be the input features of machine learning. Fast MTT decision determined by the random forest models of machine learning is proposed to quickly select the partition for intra coding. Experimental results demonstrate that the proposed method can effectively accelerate intra coding process while maintaining good bitrate and video quality based on the properties of the visual perception. The proposed algorithm provides better performance than the previous work.


I. INTRODUCTION
Video coding is one of the most important and active research topics over the past couple of decades and is closely connected to industrial applications in both hardware and software. The wide range of applications includes mobile phone, videophone, videoconferencing, video streaming, digital television, distance learning, remote medical system, etc. Videophone and videoconferencing are becoming a global trend when people need to contact each other, as well as a tool for people to provide emotional support and resources.
The associate editor coordinating the review of this manuscript and approving it for publication was Essam A. Rashed . Efficient video coding technologies are quickly applied and provide many valuable services. The explosive demands for large amounts of ultra-high-definition (UHD) video bring new challenges to the capability of video compression. JVET (Joint Video Experts Team) comprised of the ITU-T VCEG (Video Coding Experts Group) and the ISO/IEC MPEG (Moving Picture Experts Group) establish the latest video coding standard H.266/versatile video coding (VVC) [1]- [3] with coding performance higher than that of the H.265/high efficiency video coding (HEVC) [4]- [10] to meet the requirements of various industrial applications including UHD television for home theater, high dynamic range video, 360-degree video, live streaming, HD video conferencing,  HD video surveillance, augmented reality, virtual reality, mixed reality, etc., as shown in Fig. 1.
The intra prediction and the inter prediction are powerful solutions in video compression. Compared to the quadtree (QT) coding block structure of H.265/HEVC, more flexible coding unit (CU) partitioning can be provided by the quadtree with nested multi-type tree (QT-MTT) using binary tree (BT) and ternary tree (TT) splits in H.266/VVC. The QT-MTT is a new coding structure with BT and TT splits for the subtree partition in a QT. Each BT or TT split allows horizontal and vertical directions as indicated in Fig. 2. The input video frame is divided into coding tree units (CTUs), 128 × 128 blocks, under the common test conditions (CTC) [11] of H.266/VVC. In the all-intra configuration, the MTT splitting structure allows for the CUs no larger than 32 × 32. The minimum size is limited to 4 × 4 for BT or TT split. The MTT coding structure makes the splitting of CU more diversified and increases the coding efficiency. Fig. 3 shows an example with the QT/MTT partition. However, flexible partitions have larger computational complexity, which increases encoding time. Besides, the H.266/VVC encoder equips considerable coding tools. The various CU partition and numerous coding tools improve the compression performance for different kinds of video content, also tremendously increase the encoding complexity. Undoubtedly, it is very important for the acceleration of encoding to reach real-time processing.

II. RELATED WORKS
To meet the high computational demand of H.266/VVC, fast computation techniques for H.266/VVC intra coding have been investigated previously. They can be roughly classified into probability-based [12], [13], learning-based [14]- [18], texture-based [19], gradient-based [20], [21], and texture-and gradient-based [22] methods. Fu et al. [12] jointly explore the split types and intra modes of sub-CUs to early skip vertical split (including vertical BT and TT splits) and horizontal TT split. Park and Kang [13] propose a simple early decision technique to reduce the TT encoding time of intra coding motivated by the Bayesian probability approach. Peng et al. [14] utilize support vector machine to build the horizon binary-tree decision, the vertical binary-tree decision and the quad-tree decision models for fast partition of QTBT in H.266/VVC intra coding. Yang et al. [15] select the decision tree in machine learning as the classifier for QT/MTT decision. Fast intra mode decision with gradient descent search is also introduced. Amestoy et al. [16] use the random forest classifiers in machine learning to determine for each coding block the most probable partition modes. The three binary classifiers include Split/Non-Split, QT/BT and BH/BV. Since the solution of [16] is originally designed for the complexity reduction of QT/BT partitioning scheme, to adapt to the MTT partitioning, horizontal partition modes (BTH and TTH) and vertical partition modes (BTV and TTV) are both grouped as the output of the BH/BV classifier. Zhao et al. [17] design a decision tree classifier by using just noticeable difference model threshold, motion pattern and image texture features for fast CU partition decision strategy. Park and Kang [18] use a lightweight neural network model to decide whether to terminate the nested TT block structures subsequent to a quadtree based on the two kinds of features. Peng et al. [19] adopt texture features to classify CUs into simple, common and complex types. Early skipping partition modes for simple CUs and non-partition modes for complex CUs are designed. Unnecessary horizontal or vertical partition modes are excluded for common and complex CUs. Tang et al. [20] extract edge features by applying the Canny edge detector to carry out early termination and to skip vertical or horizontal partition modes. Cui et al. [21] calculate the gradients to skip unnecessary searches in CU size decision and pre-determine the likelihood of BT or TT partition in the horizontal or vertical direction. Fan et al. [22] use the variances of original pixels and the gradient features by the Sobel operator to terminate the further splitting, select the QT and choose one partition from five QT/MTT partitions. In this paper, we apply the just noticeable difference model [23] to extract the features into a machine learning algorithm for the acceleration of H.266/VVC intra coding.

III. PROPOSED METHOD
The fidelity of the decoded video is affected by not only the objective pixel distortion but also the perceptible interval of human eyes for luminance variation. Gray level variation occurs both on the edges and texture regions. Sometimes, the gradient or texture measurement may not be feasible. We need to find the regions truly sensitive to human eyes. Several literatures with perceptual video coding or processing provide promising results [24]- [30].

A. JUST NOTICEABLE DIFFERENCE (JND)
A just-noticeable distortion (JND) is employed to quantify the perceptual redundancy and the JND profile derives threshold sensitivities based on background luminance [23]. For human perception, people may just distinguish the difference once the change on luminance between the current pixel and background exceeds the perceptual threshold; below which, the differences are imperceptible. The JND visual model represents the visual distortion of luminance and the curve of visibility thresholds due to background luminance [23] is exhibited in Fig. 4. The horizontal axis indicates the background luminance pix avg . Since the minimum size of a CU for BT or TT split in MTT is 4 × 4, we utilize the average of 2 × 2 four pixels in the center of a 4 × 4 block, as shown in Fig. 5 and (1) to obtain pix avg . Each pix avg value corresponds to a JND value JND pix avg on the vertical axis. The JND pix avg is calculated as (2), where T 0 is 17, γ is 3/128. For a 10-bit video sequence, the gray level of image is divided by 4 to obtain an  8-bit image and then find the JND threshold. If the absolute value of the difference between the luminance of the current pixel and the background luminance is lower than JND pix avg , it means that this luminance difference is invisible to human eyes. If the absolute value of the luminance difference is larger than or equal to JND pix avg , VD-Pix i,j indicates the visually distinguishable pixel for human eyes as shown in (3).
Inspired by [26], we conclude that if there is a visually distinguishable edge in the horizontal direction, there would be more possible MTT splitting modes for BTH and TTH in the process of the MTT directional split. If there is a visually distinguishable edge in the vertical direction, there would be more possible MTT splitting modes for BTV and TTV, as indicated in Fig. 6. Since the positions of visually distinguishable pixels may not be exactly on a straight line, the projection-based on MTT partition is proposed in our method.
When the QtDepth of CU is equal to 2 or 3 and MttDepth is equal to 0, our algorithm separates the square CU into four vertical and horizontal sub-parts, and makes the projections to obtain the amount of visually distinguishable pixels by (4)-(5) and Fig. 7.
where Proj hor,l and Proj ver,l represent the sum of VD-Pix i,j in the l-th sub-part in the horizontal and the vertical directions, respectively. The (x, y) denotes the coordinate of the upper-left pixel of the CU with the size of w in MttDepth 0. The orientation of CU can be decided by collecting the distribution information of Proj hor,l and Proj ver,l ; then the appropriate partition direction can be determined to bypass the unnecessary CU splitting. There may be a relationship between the projection results and the partition types. Fig. 8 demonstrates the ideal cases of the relationship between the projection results and the partition types. The projection results can be calculated by (4) and (5) to obtain Proj hor,l and Proj ver,l . Therefore, Proj hor,l and Proj ver,l are used for the input of random forest model of machine learning.

B. RANDOM FOREST
Random forest (RF) is an ensemble learning of the bagging method [31]. The advantage of RF is to combine multiple basic classifiers to generate better predictive ability. The basic classifier used in this paper is a decision tree (DT) [32], and the structure of DT is composed of nodes, branches and leaves, as shown in Fig. 9. The sample set in a node is    divided into two sub-sample sets through branching actions.
Repeat the steps of setting up nodes and branches, and end the process when the samples are divided into leaves. The algorithm of the DT in this paper is CART (Classification and Regression Trees) [33]. The branch is based on the Gini impurity to select the feature and threshold. Each parent node  is divided into two child nodes by comparing the feature value of the sample with the threshold in the parent node. When all samples in the child node belong to the same target category (horizontal or vertical), this node is a leaf. RF generates different DTs by sampling features, and the prediction result is determined by the majority decision of all DTs, as shown in Fig. 10. In this paper, we take nine features including Proj hor,l , Proj ver,l (l is 0 to 3) and the quantization parameter (QP) as the input of random forest model of machine learning to predict the suitable MTT splitting mode. The QP determines the distortion of coded video and has an important impact on the bitrate. Higher QP allows more distortion and a lower bitrate. Empirical evaluations show that the following hyperparameter setting achieves the best performance. The maximum possible depth of the decision tree is 11. The number of features to be considered when looking for the best split is 20. If the number of samples in a node is less than 3, then the node will not be split. The number of tree in the random forest is 51.

C. MTT DECISION ALGORITHM
Since the MTT partition implemented in VTM 7.0 [34] is from QtDepth 2 for intra coding, we encode the first 100 frames from each of six benchmark sequences (Class A1 Campfire, Class A2 DaylightRoad2, Class B BQTerrace, Class C PartyScene, Class D BQSquare and Class E Kriste-nAndSara) under QP 22, 27, 32 and 37 to estimate the encoding time as shown in Fig. 11. According to the distributions in Fig. 11, QtDepth 2 and QtDepth 3 occupy approximately 42% and 31% on average, respectively. In other words, the VTM consumes much coding time at QtDepth 2 and QtDepth 3. If we can determine the MTT partition in advance and terminate the following MTT encoding early, a great amount of encoding time will be saved. As a result, a fast MTT decision is desired to accelerate the partitions in QtDepth 2 and QtDepth 3 with w 32 and 16, respectively, in (4)- (5).
The training videos contain the first 100 frames for the 22 sequences of Class A1 to Class E, which are encoded under the all-intra configuration and CTC. Two random forest classifiers (Classifier-H/V for QtDepth 2 and QtDepth 3) are trained to decide horizontal or vertical split. OpenCV 3.2.0 is used to individually train the random forest models to predict the MTT partition. The training accuracies of the random forest models for QtDepth2 and QtDepth3 are 93.16% and 91.85%, respectively. If the partition for MttDepth 1 is accurately selected, unnecessary splits for MttDepth 1∼3 can be omitted. Fig. 12 shows the flowchart of the proposed algorithm. During the CTU encoding, if MttDepth is 0 for QtDepth 2 or 3, the JND calculation is performed. The eight projection values of visually distinguishable pixels, Proj hor,l , Proj ver,l (l is 0 to 3) and QP are input into the trained random forest model for the prediction of MttDepth 1 of QtDepth 2 and QtDepth 3. For other values of QtDepth and MttDepth, original encoding processing is performed. The basic idea of the proposed strategy is to skip complex MTT partition structure on some unnecessary CUs by utilizing the characteristics of the human vision system and the QP as the input features of machine learning to train the random forest models.

IV. EXPERIMENTAL RESULTS
The H.266/VVC reference software version VTM 7.0 [34] is used for testing the performance under the all-intra configuration and common test conditions (CTC). The quantization parameters are set as 22, 27, 32 and 37. In our experiments, we use six classes A1, A2, B, C, D, E of the test sequences to contain natural videos with various resolutions and skip the first 100 training frames as shown in Table 1. To compare with [22], we also turn off intra sub-partitions (ISP), matrix weighted intra prediction (MIP) and low-frequency non-separable transform (LFNST) in the configuration file of VTM 7.0 as the setting in [22].
The Bjøntegaard Delta Bitrate (BDBR) [35] is measured to evaluate the coding performance. The performance of the proposed algorithm is compared with [22]. Time saving (TS) is calculated by (6) and coding gain (CG) is computed by (7).

Time Saving
where T VTM denotes the total encoding time of the VTM. T Proposed represents the total encoding time of the proposed method. QP i indicates the QP value of the i th coded video bit stream.
Study [22] presents good time saving performance for intra coding of H.266/VVC. We make a performance comparison of TS, BDBR and CG of the proposed method with [22] in Table 2. We additionally include UHD sequences (Class A1 and A2) with large resolutions into experiments. The horizontal and vertical projections of visually distinguishable pixels are related to the partition types. The proposed algorithm expresses good results and accurately captures the visual property in video texture. Our method has less average BDBR 1.14% than the 1.63% of [22], less BDBR variation (standard deviation: 0.39%) than the 0.99% of [22] and less TS variation (standard deviation: 1.75%) than the 7.76% of [22]. Less standard deviations for test sequences indicate that more stable and consistent performance can be obtained by our method. For Class E sequences, our approach achieves much less BDBR than that of [22]. Regarding coding gain, the 35.59 of the proposed method is superior to the 30.23 of [22]. In general, our method demonstrates better performance for test sequences with larger resolution and benefits the applications on high resolution video for H.266/VVC. Class A1 Campfire, Class A2 ParkRunning3 and Class B ParkScene sequences apparently maintain better BDBR than other sequences within the respective class. Fig. 13 illustrates the MTT partitions based on human perception by the proposed method for the frame in Class D BQSquare sequence. The distribution of visually distinguishable pixels evidently benefits the random forest classifiers and precisely decides the splitting structure. Fig. 14 exhibits the RD curves of VTM 7.0 and the proposed scheme for Class A1 Campfire sequence. The RD curve of the proposed algorithm is very close to that of VTM 7.0, but the 44.33% of time saving is achieved by our method. Fig. 15 and Fig.16 show the partition comparisons of VTM 7.0 and the proposed algorithm for the frame in Class B Kimono and Class C Race-Horses sequences, respectively. The CTU partition results are similar for both methods, but our proposed approach provides significant time savings. Fig. 17 demonstrates the subjective quality comparison between the VTM 7.0 and the proposed algorithm for the frame in Class A2 ParkRun-ning3 sequence. The video qualities of the two methods are similar. The structural similarity index measure (SSIM) [36] values are also shown in Fig. 17 to make the performance comparison as closer as possible to human perception. The similar SSIM values between VTM 7.0 and the proposed method show the good visual quality of the proposed method.     These comparisons demonstrate that the proposed algorithm accelerates the coding process and maintains good video quality.

V. CONCLUSION
The new QT-MTT structure in H.266/VVC consumes a large amount of encoding time. This paper presents a fast algorithm for H.266/VVC intra coding by taking advantages of the human vision system incorporating the machine learning method. We apply the perceptual model of human eyes, just noticeable difference, to detect the visually distinguishable pixels in a CU. The horizontal and vertical projections of visually distinguishable pixels and the QP are taken as the input features of random forest machine learning models to predict the MTT partition and omit unnecessary splits. Extensive experiments show that the proposed algorithm effectively reduces the encoding time while maintaining good coding efficiency and subjective video quality. He is a full-time Research Assistant with the Department of Electrical Engineering, National Dong Hwa University. He is a Technical Manager with the Eastern Taiwan AI and AIOT Academia Industry Alliance. He has coauthored conference and journal papers related to signal processing and artificial intelligence. His research interests include image/video coding and processing, medical signal processing, and big data analysis in artificial intelligence.