VVC/H.266 Intra Mode QTMT Based CU Partition Using CNN

The latest standard for video coding is versatile video coding (VVC) / H.266 which is developed by the joint video exploration team (JVET). Its coding structure is a multi-type tree (MTT) structure, which consists of two types of trees that are Ternary Tree (TT) and Binary Tree (BT). Due to the use of brute force quest for residual rate-distortion the quad-tree and multi-type tree (QTMT) structure of the coding unit (CU) split and contributes over 98% of the encoding time. This structure is efficient in coding, however, increases computational complexity. The current paper proposes a deep learning technique to predict the QTMT based CU split rather than just the brute-force QTMT method to substantially speed up the time of the encoding process for VVC/H.266 intra mode. In the first phase, we developed an extensive database containing ample CU splitting patterns and various streaming videos. In the second phase, we suggest a multi-level exit CNN (MLE-CNN) model with a redundancy removal mechanism at different levels to determine the CU partition. In the third phase, for the training of MLECNN model we have established the adaptive loss function and analyzing the both unknown number of partition modes and the focus on RD cost minimization. Finally, a variable threshold decision system is established to achieve the targeted low complexity and RD performance. Ultimately experimental findings show that VVC/H.266 encoding time has reduced to 69.11% from 47.91% with insignificant bjontegaard delta bit rate (BDBR) to 2.919% from 1.023% which performs better than the existing futuristic and modern approaches.


I. INTRODUCTION
Many high-resolution video content and its applications are available nowadays in ultra-high definition (UHD), 4K, and 8K. The continuously increasing high-resolution video stuff needs advanced encoding techniques [1]. The joint video exploration team (JVET) has established the moving picture experts group (MPEG) and video coding experts group (VCEG) to work on the most advanced and high-end video coding standards like VVC/H.266 [2]. The latest version, known as (VTM 8.0) was released at the start of 2020 by JVET. It was a video test model [3]. The coding efficiency of VTM has increased by 40% then the test model (HM) of HEVC/H.265. Earlier, a quad-tree and multitype Tree (QTMT) was introduced, a multi-type nested tree.
The associate editor coordinating the review of this manuscript and approving it for publication was Shuihua Wang . The coding unit concept is introduced in the VVC/H.266 instead of prediction unit (PU), transform unit (TU), and coding unit (CU). In QTMT, the shape of CU is rectangular. Leaf nodes of QT are obtained from QT as it is a subpart of the coding tree unit (CTU). The leaf node of QT is minimum as 16 × 16 and maximum as 128 × 128 in size, whereas 128 × 128 in size for CTU. BT and TT cannot divide QT size if 128 × 128 because TT and BT have a maximum root node size of 64 × 64. MTT can divide further leaf nodes of QT. The root node of MTT at this time is the QT leaf node. The division is not allowed anymore after the MTT allowable depth or after the depth limit. Similar case with the width of the MTT node. If the size of the leaf node of BT is minimum or equal and less than or equal to the width of the MTT node, then division will not be considered horizontally. It is also true for the equal or less than the double BT minimum leaf node size. In the vertical case, if the size of the BT leaf node is minimum or equal to the height of the MTT node, then the division is not possible vertically. IT is also true for the equal or less than the double BT minimum leaf node size. Traverse at all the paths of TT, BT, and QT divisions are required by the CU division finally. We calculate the rate-distortion (RD) cost for each depth, and then partition mode selection is based on the lowest RD. Hence computation complexity increased a lot for QTMT structure for the partition of the coding unit compared to HEVC/H.265. In Figure 1 QTMT architecture separates the CTU.
Moreover, a significant change in the intra prediction like DC mode is provided, planner 65 modes of prediction instead of 35 intra modes of prediction provided in HEVC/H.265. Using 65 modes of prediction provides the best prediction results. In we have got a better quality of encoding and accuracy, but coding time has increased and computation complexity as well. There are some modern coding tools in which are very useful for the latest video coding standard and provide better coding efficiency [2], [4]. Nevertheless, new advanced coding tools are introduced; as a result, they produce high coding complexity and low speed of video encoding. HM had 19 times less coding complexity than VTM in the test condition of configuration under the (All intra) configuration [5]. Therefore, it has to search for a point where coding performs better with the low coding complexity.
Several researchers applied deep learning techniques to address the H.265/HEVC in terms of coding complexity. Accordingly, constitutional neural network (CNN) based fast algorithm for CU decision mode is proposed in [6] to the complexity of coding. Similarly, texture based encoding techniques in the HEVC/H.265 are introduced, and a high-speed Partition method for the intra CU using the CNN and texture classification is shown in [7] where intra coding complexity is reduced by using the CNN feature of heterogeneous texture feature (HTF). In the HEVC/H.265, fast intra coding methods and fast CU partition are enlightening techniques for the VVC/H.266. These techniques are not so useful for the VVC/H.266 because VVC/H.266 uses the structure that contains the QT structure.
The early termination procedure by using the confidence interval in high-speed VVC/H.266 is presented in [8] for QTBT to detect the RDO redundant partition modes.In [9] paper, a fast decision algorithm is used for the partition of CU using the characteristics of the spatial domain. Their purpose is to reduce the computational complexity of the BT and QT tree structures found in VVC/H.266. The decisions based on Bayesian for the rapid partition are disclosed in the [10]. The main benefit of this work is a correlation sub-coding unit and the main coding unit or parent CU. The output comes in terms of fast VVC/H.266 intra coding. In [11] a fast algorithm is introduced for intra coding early skip way using redundancy removal in the pruning of MTT. Another fast approach for intra coding is presented in [12] to improve VVC/H.266. The fast intra coding speed is achieved by gradient descent and a fast decision system for intra modes. Statistical learning is used to integrate CTU with less complexity-based derivation architecture.
In summary, the dominant research directions for video coding in H.266/VVC are reducing coding time, increasing coding speed, and minimizing the complexity of coding at the cost of coding efficiency decreases a little. Some unique technologies and tools are developed for the H.266/VVC, and some tools and technologies are being improved that were used in the H.265/HEVC are leading to the development of H.266/VVC and increasing its coding efficiency as well. Computational complexity is also improved by improving the tools used for the H.265/HEVC and developing advanced tools for the H.266/VVC. The rest of the paper sections are arranged as follows: Section II contains the related work about video coding standards and VVC standards. Section III contains a detailed discussion on the proposed methodology. Section IV describes the working of CU partitioning and the database description of the CU partition, which is QTMT based. Section V contains the related work about minimizing complexity for the VVC/H.266, and we have proposed the MLE-CNN model for VVC/H.266 CU partition. Section VI displays our results after experiments, verifying our approach accuracy and novelty. In the final section, VII, the papers' conclusion is written.

II. RELATED WORK
Many techniques have been used to improve the partition of the coding block for HEVC/H.265 and the previous one.

A. VIDEO CODING STANDARDS
In [13]- [16] the VP9, AV1, AVS2, and HEVC were the main video coding standards before. The joint collaborative team on video coding (JCT-VC) has developed HEVC international video coding standard which becomes a research VOLUME 10, 2022 focus. HEVC uses the simplified partition of the coding tree unit (CTU) and can be divided into two groups. The first one is data-driven and the second is a heuristic approach.
To build a statistical model it gets features while encoding in the Heuristic technique. The partition of CTU can be simplified by using the RDO Brute force search with this model. Moreover, in the CTU partition, we can skip the reputation. The encoding time is increased due to the CU partition in HEVC. Few researchers propose the partition of CU at the early stage [10], [17]- [21]. While in the second technique, computational complexity reduction of HEVC is achieved by the data-driven method significantly with the automatically handmade feature extraction and learning. Computational complexity of CTU partition is significantly reduced by the CNN [12], [22]. Using the CTU structure CNN based intelligent decision simplified the process of encoding. Prediction for the intra mode has been designed utilizing multi-classification. CNN uses its layers to predict the best suitable model for the prediction. For the partition of the CU output, an early terminated hierarchical CNN is used [7]. Hence, they minimize its complexity for HEVC.
Comparatively data-driven technique, the heuristic method is less accurate as far as CTU partition accuracy is concerned. High accuracy for the prediction is helpful for RD's complexity and performance. Both of these methods are used to replace the complexity, with learning the different tree types like a quadtree, ternary tree, and binary tree dependent structure of partition block. However, in our proposed technique redundant features are managed and no extra information is transmitted.

B. VERSATILE VIDEO CODING-BASED STANDARDS
The VVC/H.266 CU partition is more flexible due to QTMT and QTBT structures. The complexity of VVC/H.266 is high like HEVC but can be reduced with data-driven and heuristic methods. In [17]- [19] shows the QTBT structurebased work. A fast method for QTBT encoding is proposed in [17], where a temporal frame index uses the full binary tree path. A joint classifier is used to propose a QTBT fast decision method [20]. For the early stop of the partitions of QTBT, a random forest is used in [21] which stops redundant iterations. The CU partition of reduced complexity eliminates the unnecessary intra prediction and partition modes. Similarly, deep-learning-based algorithms for improved coding are described in [17]- [19] for VVC/H.266. The decision for bottom length methods is proposed in [17] and [18] to improve the intra coding method in VVC/H.266 by using a multi-class classification depth range model.
According to data-driven method [10] proposed to predict the length of CU path in every CU 32 × 32 using a CNN to eliminate the search of RDO at intra mode free CU. A different way to apply CNN to predict the CU path length is used at inter mode. CNN input is residual CU because the partitions are correlation-dependent across various frames. The ResNet with 4×4 blocks is used to predict all CU borders directly and accurately predict the CU partition. Hence, they reduce the complexity of VCC [12]. However, the bottom-up decision produces the extra computation when a large CTU produces some significant CU splits and no splits.

III. PROPOSED METHODOLOGY
The QTMT structure is used to split the coding unit, and 67 intra prediction modes are used for the intra prediction. For the fast and accurate prediction, our approach is CNN to predict intra mode CU partition for the VVC/H.266. The novelty of the proposed work is different from existing research in three aspects.
Firstly, we have proposed the latest and novel N-QTMT structure, which is fast and accurate among the existing approaches. In the present work data-driven, QTBT was designed only for the partition of CU in [10], [12], and [22].
Secondly, we use a deep learning approach to extract the features from video frames automatically instead of the handmade features extraction method that is used in [17], [18], and [19].
Lastly, we have designed multi-level CNN. It predicts the big CU partition from previous layers and, using the latter layers' model, predicts the small CU. It is comparatively better than the bottom-up decision approach where CU boundary-based decision is used. So, by using CNN, we can avoid the redundancies and early exit approach.

IV. WORKING OF CU PARTITION WITH DATASET A. CU PARTITION DESCRIPTION
This section briefly defines the CU partition process in VVC/H.266, which is quite different and flexible from HEVC. A CTU has one CU or repeatedly divides in small squares CU using a quadtree in HEVC standard. The 64 × 64-pixel size is found by default in CTU. In HEVC, an 8 × 8 minimum size of CU is possible. Whereas in VCC standard, there are many flexible CU partitions. N-QTMT is obtained from the QTBT as far as CU partition is concerned. Moreover, the N-QTMT structure can also divide the CU into square and rectangle forms. Hence these CUs can adopt more complex texture features on the frame of a video. By using the quadtree, the working of the N-QTMT structure can divide the CTU into one CU or smaller CU to get the tiny details from frames. Small CU can further divide themselves using the multi-type tree or the quadtree. Concerning the vertical and horizontal modes, the multi-type tree contains the ternary and binary trees. Figure 2 is the example shown. The range of CU in CTU has the 4 × 4 as a minimum to 128 × 128 as the maximum in range. Furthermore, the intra mode partition of CU is used for the chrominance and luminance channel individually. A multi-level hierarchical partition method is present to obtain the split CU with the earlier features. The process of splitting 128 × 128 CU into 64 × 64 CU's is shown in Figure 2 as level 1 process. After the level 1 process, 64 × 64 CU's are further split up into 32 × 32 CU's at level 2 and further on. Stage 1 and 2 support quadtree and non-splitting modes at all six levels. A maximum of six modes (i.e., non-splitting, quadtree, horizontal binary-tree, vertical binary-tree, horizontal ternary-tree, and vertical ternary-tree) are possible for the successive phases minimum width or height for CUs is 4. The possible CU size and split modes concerning levels are illustrated in Figure 2.
The description summarizes that QTMT used in the VVC/H.266 is more advanced and flexible concerning the sizes and types of CU compared to HEVC/H.265.

B. PREPARATION OF DATASET
We have created the database to train and evaluate our model and partition of CU for intra mode. The preparation of data is based on approximate 9000 images [17] and 300 total video sequences [23]- [28]. This collected data contains different resolutions and a variety of contents and is openly available for research purposes. The data is split into three distinct sets for training testing and validation. Total 7200 images and 240 video sequences are used for the training, and 900 images and 30 video sequences are used for the testing. At the same time, 900 images and 30 video sequences are used for validation purposes. A short description of the data is listed in Table 1. VTM 7.0, a reference software, is used to encode images and videos by VVC/H.266. The images were prepared 8 × 8 in size and multiple in resolution due to VTM support. And video sequences are not more than 10 seconds in length. Four quantization parameters, QP22, QP27, QP32, and QP37, encode the images and video sequences.
After encoding, labels of CU partition are obtained for the Dataset. Labels represent the ground truth information for CU mode division. Each mode can have the six achievable CU modes of the division. The 0 modes can be non-split mode, one mode can be quadtree mode, two modes can be a binary tree for horizontal, three modes can be the binary tree for vertical, and four modes can be a ternary tree for horizontal, and five modes can be a ternary modeled for vertical. For the model training, we saved all CU mode's RD and used it for RD optimization in VVC/H.266.
The corresponding label of the partition of each CTU and RD cost for every CU makes a specimen in our database. All the details about total samples and the total number of CU's are presented in the database are shown in Table 1. The dataset includes 6,809,233 specimens with over 1 billion CUs in total, as shown in the table, providing adequate data for the training of our MLE-CNN model. It describes the database concerning the number of images according to their resolution, the total number of CTUs, and the number of CUs.
The various split modes that provide the detailed CU proportion are presented in Figure 2. It illustrates the dependency of splits of modes on specific CU. Its range varies from 2 to 6, and rules are already discussed for partition in the previous section, ''working of CU partition.'' Various split modes are unstable, as the ternary tree CU division. Especially mode number four and mode number five are dominant for this. Secondly, in CU, non-split case 0 mode is a primary element for the many CU sizes. So, simple image classification is easy compared to the multi-level Partition of CU because it has not balanced classes with a single output. To deal with this problem, we focus on the CNN model explanation.

V. COMPLEXITY MINIMIZATION FOR INTRA MODE OF VVC/H.266 A. CU PARTITION LEARNING USING MLE-CNN
We will describe the MLE-CNN learning mechanism of CU partition, which is QTMT based in VVC/H.266 The all-viable CU in a certain CTU must be marked down to upward sequence using the RDO brute force search in a standard encoder used for the VVC/H.266. But we propose to predict the CU partition using MLE-CNN. It is a top to bottom and stepwise approach. Hence, we have a fast-encoding process. The overall structure of MLE-CNN is illustrated in next section.
MLE-CNN takes the 128 × 128 CTU luminance channel as an input and follows the conventional layer to get the features map of 128 × 128. We apply a maximum of 6 decision units of divided modes for 6 level CU partitions using these extracted features maps. For the extraction of texture features using the MLE-CNN, several consecutive convolution layers are required through features map flow. So, these convolution layers are called conditional convolution as well. A small network inputs the features map into it and then predicts the first CU partition. The early termination of the CU partition at the present state means the prediction results are non-split. If not, then the next stage would have the location of every divided CU with the feature maps. The conditional convolution and small network details are described here.

1) CONDITIONAL CONVOLUTION
The neural network works well if it is trained on several features and deep. That is why we use deep network MLE-CNN and extract more texture features to use. The structure of our network is quite flexible as it depends on the size of CU. The size of CU can be different on some level. We have deep extracted features that are useful. Figure 4(b) shows the Efficient ResNet model. If ω × h is the CU size then min(ω, h) is the minimum length of axis for CU. The granularity of the CU partition is computed by it. Assume of CU of the current axis length is smallest are a p and a c similarly the processed input feature maps are shown with residual units η r ε (0, 1, 2) then the formula would be So, the residual units contain the convolution operations. These are one stride overlapping and zero-padding operation. In this operation, features map size would not be altered after this sub-network would take the input of residual units with processed feature maps. This flexible design provides a unique property to the MLE-CNN. The residual unit has total 6 indexes which are κε 1, 2, 3, 4, 5, 6 is define at the time when known CU is satisfy the κ = log 2 256 min(ω,h) We need some parameters for training to satisfy the condition of input in small network which is the same CU must have the same features for all residual units with a value of k index is same.

2) SMALL-NETWORKS
The partition of the CU may be 64 × 64 or smaller in every sub-network; all the connected layers and convolutions process the feature maps to predict where to split. The CU size and sub-network configurations are closely related to each other, as shown clearly in Figure 4(b). For detailed features of the CU partition, two or three convolution layers are used to process the input of feature maps in each subnetwork. The Height and width of the Kernel of each layer of convolution are defined as the power of 2. Moreover, the height and width of kernels are equal to the two dimensions of kernel stride. Hence all the kernels are not overlapping. The convolution, which is non-overlapping, is adaptive to no overlapping CU concerning size and location in the final partition. The convolution layers produce the output of feature maps used to get the split mode. The output of the accurate prediction of the vector and range of its length depends on the size of CU. Its range is from 2 to 6. QP also plays an essential role in CU partition. If it is decreased, then split tendency would be increased and vice versa. The external feature is supplemented as QP after the fully connected first convolution layer. The QP-related features are considered in the MLE-CNN, so that we have used the operation of half mask. The half-extracted feature maps are multiplied by the normalized QP value. At the different QP values, the MLE-CNN can learn the partition of CU. In a nutshell, we can say the control of the CU partition procedure depends on the output of the small network. If a non-split CU is predicted, then at the current stage exit procedure is executed. If not, the next stage is processed, and exit is not executed.
Our approach combines the ResNet model with small network operations to design a multi-level design. It is a significant model for the CU partition and is a QTMT base for the VVC/H.266 structure. This design significantly reduces the computational complexity of MLE-CNN because it exits after it finds non-split. It simply skips the redundancies. The experimental results are shown in section 6 for verification and explanation purposes.

B. MLE-CNN TRAINING AND LOSS FUNCTION VALUES
This proposed Model is typically better than other classification designs because of three reasons as follows: 1) The size-dependent split mode, as its range differs from 2 to 6. Details are in section IV 2) Flexible exit strategy for the different exit modes is not fixed. Details are given in figure 3. 3) There is not only a cross-entropy function that exists.
In VVC/H.266, there are various RD costs for every split mode. Therefore, the adaptive loss function must for the MLE-CNN to the properties mentioned above. For the wide and h high CU, all expected split modes are mentioned with M(ω, h). Each element M is indexed of current possible split mode in the M(ω, h), where M will be 0 to 5. We take small batches for training which size-wise is the same. Consider as the index of CU and N as the batch size. The apply the loss function of cross-entropy as follows.
The above equation shows binary label ground truth y m,n and y m,n the probability for prediction for the nth CU at m split mode. As far as proportional unbalanced for split mode is concerned, various penalty weights are referred to according to the unbalance. Hence cross-entropy changed to the following.
where CU proportion of quantitative is shown as P m and split mode is m m=M P = 1 is satisfactory. Moreover, αε [0,1] is not a fixed vector to show the weights of penalty. If α = 0 in this situation there is no penalty is applied p m mεM shows. The other case where α is one means that the inverse of p m is proportional to penalty weights. It works if MLE-CNN is not trained properly. The split modes of prior distribution are hard to learn the setting. Hence bad prediction accuracy would be observed. The practical use of α is between (0, 1) then we can get the balance between prediction reliability and accuracy. In our case, the value of α is 0.4 tuned by our validation data set. The experiments section is explained in detail parameters tuning.
In equation (3) two addressed properties are explained but the third property which is describing the RD loss function is followed here.
where r m,n nth CU RD cost at 'm' split mode. r n,min is the current CU's least cost from all the possible split modes. This equation (r m,n /r n,min −1) shows the cost of RD is normalized. Now if the predicted probability is not accurate or wrong then the penalty would increase. Add the equation (3) and (4) the total loss for MLE-CNN is where beta shows the RD cost importance which is a positive scalar value. The MLE-CNN can be trained for the reduced L. VOLUME 10, 2022

C. MLE-CNN DECISION APPROACH
In an ideal case, the working of the MLE-CNN is to check out the redundancy of CU in the process of original RDO hence reduce the complexity of encoding. Secondly it predicts completely CU partition. But the proposed model can predict the wrong predictions as well. The bad RD performance shows if the partition of a CU is wrongly predicted. We propose the variable-threshold decision method to get a best-balanced point between RD performance and the complexity of encoding. In our proposed method we apply combinations of τ s with τ s 6 s=2 have 0, 1 value, to all levels of MLE-CNN. The s in the formula shows the level index. Look back on the predicted probability y m,n formula, which shows any number of CU with them split mode in the mini-batch. From set m or candidate mode, we have picked the M at level number 1; the MLE-CNN need not predict VTM-encoder because it is already deterministic. The values of this variable threshold begin from the second level. This y n,max = max mεM y n,m formula shows the predicted probability maximum. The CU expected modes are represented m belongs M and y m,n ≥ τ s .y n,max shows the marked encoder RDO probability mode with rest of the modes are ignored. The confidence of the MLE-CNN prediction is controlled by the τ s threshold. The MLE-CNN has split all the CU modes at the threshold τ s value = 1. For RDO process 1 mode with this formula τ s with τ s 6 s=2 is selected for the process. Due to this described setting, we have found the minimum complexity for the encoding. If RD is bad, then there is high encoding complexity observed. On the other hand, if τ s = 0, it means the RDO process has marked all the CU. It means the RD value is not bad with less computational complexity. In an experiment, we observed that the threshold value is set from 0 to 1. Next, we propose the level selection τ s with τ s 6 s=2 in MLE-CNN. Each stage can have different accuracy values for prediction. Figure 5 illustrates the different threshold values τ s with respect to the prediction accuracies in MLE-CNN. Section V shows the details. The split size is variable corresponding to the various level and sizes. MLE-CNN can handle classification problems with multiple classes, where top-half accuracy is presented, shows the level 2 accuracy as the best accuracy. At level 6 the accuracy is better after the level 1 accuracy at τ s with τ s 6 s=2 threshold value. Its' performance is outclassed with large values. Comparatively τ s with τ s 6 s=2 this is near to zero. For all other stages, unequal accuracy is inadequate.
There are some strategies to pick the variable threshold to ensure prediction accuracy MLE-CNN.

VI. RESULTS OF EXPERIMENTS
It is the most critical section of this paper. The complexity of intra mode of VVC/H.266 is reduced with the help of experiments of our proposed approach and evaluating our designed model and its performance. The first part of this Section V(A) shows the different settings for the experiment according to our proposed approach. And the second part of this section presents the evaluation of RD performance and complexity concerning the modern and futuristic research works illustrated in [10], [12], and [22]. Finally, the third section of this paper is about the time consumption of the model. And then decisive remarks of this conducted research work are described in section V(D).

A. SETTINGS FOR EXPERIMENTS
In experiments, we implemented all the perspectives to reduce complexity in VVC/H.266 using VTM 7.0 reference software. The evaluation of our experimental techniques is based on the 900 images (test images) and 60 videos (testimages) sequences in an image database. The four QP 22, 27, 32, and 37 values are used on which AI configuration of video clips and images are encoded. We compare the time reduction rate of the encoding with the actual VTM software. We saved to calculate the complexity decrease. In [29] measure the RD performance with Bjontegaard-delta-PSNR (BDPSNR) and Bjontegaard-Delta-Bit-rate (BDBR). To perform all experiments, we use intel CPU -e5-2680 V4 with 2.40GHz processing speed, 256GB ram, and Linux-Ubuntu-16.04, 64bit-OS. For training purposes, we use NVidia GeForce-RTX-3060-Ti GPU. But the testing of the performance of the encoding task was performed without GPU. Hence, we could compare the performance of encoding fairly. 37252 VOLUME 10, 2022

1) MLE-CNN SETTING
The CU partition of chrominance and luminance is calculated independently in the VVC/H.266 standard. So, model training for the various color channels individually for MLE-CNN. The overall 19 models of MLE-CNN we used for training for channels of the color size of CU. Figure 6 shows the order of various MLE-CNN and each model for its trainable elements. The total CU's with rectangular shapes put them to model with a height less than the width.
If we find a CU with a width less than the height, transpose that partition pattern, and content is needed. All the hyperparameters that are used to train the MLE-CNN were calibrated on the validation data set of our database [10]. In MLE-CNN precisely the loss function, we set 0.3 for and 1.0 for the. In starting, all the biases and weights are initialized randomly. For every model trained from zero, the batch size was 36 and iterated 500,000 times. The 10-3 was the learning rate and then reduced exponentially by 1% after 2200 iterations. The parameters were not changed but with Adam's algorithm [12] the trainable components parameters were fine-tuned. When MLE-CNN enters the phase of inference, the selection of variable threshold values as described in section 4.2 and table 2. The table shows the threshold values among the fast mode and medium mode. The average value during the fast mode was 0.5 and 0.3 during the medium mode.

B. EVALUATION OF PERFORMANCE
According to coding efficiency and complexity decrease we have the state-of-the-art models [10], [12] and [22] so we compare our MLE-CNN model performance with them. These models [10] and [12] use the QTMT based partition of CU for VVC/H.266 as we did. Moreover, we test the  Tables 3 and Table 4. Our results are based on the 60 video sequences and 900 images respectively. The Table 3 demonstrates our less time-consuming method which decreases the 59.57% ≈ 69.11% time of encoding for a video sequence, this time reduction is quite considerable. It was 55.65% ≈ 59% [10], [12] 52.48% ≈ 64.44% and [22] 38.19% ≈ 41.79%.
In addition to the RD performance, our proposed method has the minimum BDBR and 1.023% of redundancy. The average loss in the BD-PSNR is 0.055dB. These above mentioned our achieved values are better than the [10], [12], and [22] advanced models, also our results are comparable with [30] and [31]. In terms of this applied matrix BD-PSNR, BDBR and our model performs fast comparatively [10], [12], [22]. Our MLE-CNN performs best in terms of RD performance and complexity on the video clips. It is possible due to the data-driven approach that MLE-CNN uses. So, it shows high accuracy, and secondly, it uses the direct prediction method. In addition, RDO search ignores the redundancies. Images have the same results that are illustrated in table 4.
In Figure 7 shows the most comprehensive analysis about the Complexity of RD performance of multiple methods using the four QP points. Section V(A) described the variable threshold points by using our method. Our proposed approach shows in the form of a curve at the left bottom side comparatively all other lines for images and video. It simply means our model consumes less encoding time than others at the equal BDBR value. In the same encoding time, our model performs better RD performance value. Consequently, our proposed approach is verified.

C. MLE-CNN TIME CONSUMPTION ANALYSIS
To increase the efficiency of encoding of VVC/H.266, our model must use less time for computation. Otherwise, it will VOLUME 10, 2022 create overhead. So, we deeply analyze the time consumption of MLE-CNN running time and compare it with VTM 7.0 encoder. The Figure 8 described the ratio of total time  consumed in encoding and time used for the MLE-CNN. The average outcome for overall test videos and images with the equivalent 4 QP points. MLE-CNN produces 5% or even less overhead for most of the resolutions compared to real VTM. It is explained in the Figure 8-the similar explanation described for the videos and images overheads. For images, the average overhead time is 3.02% and 3.67% for the videos, respectively. This overhead is a small part of the overall encoding time. This less time achievement is because of redundancies removal from the CU partition. Note that this partition is QTMT based. Finally, the results for all encoding time are minimized to 64.53% average and 45.96% at Fast mode and medium mode. The verification of the performance is described in Section V(B).

D. INVESTIGATION OF THE STUDY
This section investigates our proposed MLE-CNN to judge the working of its parts and their effectiveness. Table 5 shows the results in this regard. First of all, we tested single-level exit CNN comparatively multi-level CNN. In this case, we are not considering the RD cost in the loss function. Toward fast mode, In Section V(B), the value of β = 0 and the variable threshold value s are constant for the level. After that multilevel structure, RD's variable threshold and cost are included 37254 VOLUME 10, 2022 consecutively. In the table 5 categories 1, 2, 3, and 4 are given as multi-level, RD cost, variable threshold, and Fast mode, respectively.

1) SINGLE LEVEL/MULTILEVEL CNN ARCHITECTURE
In the single-level CNN, all the networks that get the input of level 2 feature maps (conditional convolution) are different from our proposed MLE-CNN, where we use feature maps from multi-levels. Specifically, the first layer of every residual unit of single-level exit CNN is extended 48 from 16. The MLE-CNN and the SLE-CNN both have equal numbers of trainable parameters. Therefore, we have compared the SLE-CNN with the MLE-CNN as categories 1 and 2. The table 5 shows clearly the good coding efficiency of MLE-CNN as compared to the SLE-CNN. So, The MLE-CNN is significantly better with the 0.150 dB of BD-PSNR escalation, and BDBR is 2.968% saved.

2) RD COST WITH AND WITHOUT LOSS FUNCTION
In the loss function, we find the RD cost, which shows the split modes and their good organization during the training time of MLE-CNN. We compare the loss functions one time with and the second time without the RD cost. Table 5 shows categories 2 and 3 for this. The 0.243% BDBR minimized redundancy in RD cost, and BD PSNR is 0.008dB improved. Moreover, at four points of QP, the encoding time is saved to 2.82% from 0.85%.

3) VARIABLE THRESHOLD EVEN/UNEVEN TO LEVEL
As far as MLE-CNN implementation is concerned, the variable threshold values are different at the multiple levels of partition of CU. The accuracy of the prediction at different levels is adjustable. For the comparison of the MLE-CNN variable, threshold values with uniform and no uniform values are used in the table 5 as 3 and 4. In fair comparison for both settings (4, 5), 0.5 is the average value for the threshold for all levels. In table 5, category four is exceeded by category three. The BDBR value is 0.140% time saving, and PSNR increases to 0.007dB with encoding time. In summary, the 1 to 4 category, minimization of complexity, and RD performance complexity are continuously upgraded. It means that flexible variable threshold and Loss function's RD cost are better for our proposed model MLE-CNN.

VII. CONCLUSION
We have suggested a deep learning focusing approach in this paper to predict the partition of CU, which is QTMT dependent and used to make fast and accurate intra mode of VVC/H.266 encoding. The VVC/H.266 partition of CU is quite flexible as compared to the HEVC/H.265. We developed an extensive size database for the various CU partition patterns and then examined the present split mode of CU at multiple CU's levels. Next, we proposed a deep learning model, MLE-CNN, which uses the small networks and conditional convolution with the power of the used network. We, then, the MLE-CNN model has established that contains the early exit mechanism that can ignore the marking process of non-functioned CU. In addition, the variable threshold decision system was established; it has got the best value in the middle of the complexity of encoding and output of RD.
After the experiments were conducted, we found the results of our proposed approach from 47.91% to 69.11% on average save the encoding time with the insignificant BDBR boost on video from 1.023% to 2.919%, surpassing the existing modern research works. Using the deep learning approach, the encoding time of the inter mode for VVC/H.266 can also save more time than now in the near future. There is a lot of potential for the deep neural networks and convolution neural networks to optimize the other parts of VVC/H.266, for example, the inter mode and intra angular selection to optimize the CU partitions. Furthermore, different advanced and futuristic techniques and their implementation in the field of network or field-programmable gate array can accelerate the techniques of CU partition specifically. In the coming years, this can be seen as a promising future job for facilitating quick VVC/H.266 encoders.