Fast CU Decision-Making Algorithm Based on DenseNet Network for VVC

The joint video expert team (JVET) is currently developing a new video coding standard called H.266/Versatile Video Coding (VVC). Compared with High Efficiency Video Coding (HEVC), VVC has added a variety of coding tools. These tools have greatly improved video compression efficiency and maintained a high level video quality. However, due to the increase in computational complexity, the encoding time is much longer than HEVC. We propose a prediction tool based on DenseNet (a convolutional neural network) to decrease the VVC coding complexity. We predict the probability that the edge of <inline-formula> <tex-math notation="LaTeX">$4 \times 4$ </tex-math></inline-formula> blocks in each <inline-formula> <tex-math notation="LaTeX">$64 \times 64$ </tex-math></inline-formula> block is the division boundary by Convolutional Neural Networks (CNN). Then, we skip the unnecessary rate distortion optimization (RDO) and speed up the coding by probability vectors in advance. The proposed method can reduce the coding complexity of 46.10% in VTM10.0 intra coding, while Bjøntegaard delta bit rate (BDBR) only increases by 1.86%. In the sequence with a resolution greater than 1080P, the acceleration efficiency can be at 64.81%, the BDBR loss only increased by 1.92%.


I. INTRODUCTION
Due to the increase in IP video traffic in recent years [1] and the emergence of new video formats such as 4K, 8K, High Frame Rate (HFR), Wide Color Gamut (WCG) and VR video, the demand for video transmission bandwidth and storage has exploded. At present, how to more effectively encode new generation videos, such as ultra-high resolution and high dynamic range, has become one of the research hotspots in the world academia.
Some video coding standards have emerged, such as H.264/Advanced Video Coding (AVC) and HEVC. However, the compression ratios that these standards can achieve cannot keep up with the rapid growth in demand for video data. The International Telecommunication Union (ITU) and the ISO/IEC Moving Picture Experts Group (MPEG) formed the JVET to develop new video coding standards. In April 2018, the JVET officially named VVC [2].
Compared with HEVC, the compression ratio of VVC has been greatly improved, but its encoding time is also many times that of HEVC. VVC uses a hybrid coding technology The associate editor coordinating the review of this manuscript and approving it for publication was Zhaoqing Pan . framework. Its image division has evolved from a single, fixed division to a diverse and flexible division structure, which can more efficiently adapt to the encoding and decoding of high-resolution images. Among them, the QTMT division scheme [3], [4] is used in VVC to obtain better compression efficiency. The scheme has five division modes: Quad-Tree (QT), Binary-Tree-Vertical (BTV), Binary-Tree-Horizontal (BTH), Ternary-Tree-Vertical (TTV) and Ternary-Tree-Horizontal (TTH), which are more than HEVC in two ways, TTV and TTH, as shown in Fig.1. All five modes can be used, but QT splitting cannot be used for sub-blocks in other split modes. This segmentation scheme makes the segmented sub-blocks more suitable for the texture distribution of the image, which greatly improves the accuracy of the internal prediction and reduces the prediction residuals. Compared with HEVC [5], this change increases the coding efficiency by 8.5%. However, due to the two additional partition types, the computational complexity of RDO is much greater than that of HEVC, resulting in its encoding time several times that of HEVC.
An important direction of the current VVC video coding research is how to reduce the coding complexity and increase the coding speed without reducing the VVC coding efficiency or with little loss. Recently, the development of artificial neural networks has provided a new direction for the development of fast video coding.
In this paper, we design a DenseNet-based VTM10.0 internal encoder complexity reduction technique. We provide CNN with a 64 × 64 pixel luminance Coding Unit (CU) to predict a vector to represent the probability of an edge on the 4 × 4 boundary of the block. The encoder further uses this probability vector to skip the low-probability segmentation.

II. RELATED WORKS
The VVC video coding standard has five different CU partitioning methods. The attempts of different CU partitioning methods during encoding occupy most of the encoding time. Therefore, reducing the number of CU partitioning can significantly reduce the encoding time.
In 2019, Tissier et al. studied the CU partition complexity of video coding in [6]. They proposed that the computational complexity of block partitioning in VVC can be reduced to 3% of the original at most by predicting the correct CU splitting method. So a lot of coding time can be saved.
At present, most block segmentation acceleration algorithms reduce the computational complexity and save coding time by terminating unnecessary RDO in advance [5]. Algorithms are mainly divided into two categories, traditional algorithms and applied machine learning methods. Traditional algorithms analyze the complexity of the texture by extracting features, such as the variance and mean square error of the image, and set a threshold to determine whether to terminate the RDO calculation early.
Reference [7] proposed a fast intra algorithm based on variance and Sobel operator, where the variance and gradient information of the CU is calculated to determine whether the current CU should be split. Reference [8] used the Canny operator to extract the edge information of the image, and analyzed the edge information to determine the most likely dividing direction of the current CU. A threshold is set when the horizontal and vertical edge information. When the ratio of the features is higher than the threshold, the horizontal division is tried. When the reciprocal of the ratio is higher than the threshold, try the vertical division. If the two conditions are not met, try QT division. In [9], an algorithm proposed to calculate the rate distortion (RD) cost of the horizontal and vertical binary tree partitions. In the binary tree splitting, the cost is smaller, and the cost is usually higher in the MTT splitting. Reference [10] also uses the two features of image variance and gradient to accelerate segmentation. The method [11] uses Bayesian probabilities as features to skip unnecessary splitting modes. Most traditional algorithms only use one or two features, which can only filter out a small amount of redundancy. Therefore, the acceleration effect of block segmentation is limited.
Due to the rapid development of machine learning and deep learning in the past two years, the combination of block segmentation acceleration algorithms and machine learning has gradually increased. The trained Support Vector Machine (SVM) in [12]- [15] is used to filter possible segmentation strategies. In [12], 6 different SVM classifiers are trained for blocks from 32 × 32 to 16 × 8 and 8 × 16 to adapt for the situation of CU division of different sizes. In [13], two SVM classifiers are trained to divide the results into three categories. These two types of data are segmented and non segmented, and the error is small. The third type of data is at the boundary of the two types, so it needs to be calculated in the next step to determine whether to divide. In [14], 11 features of the SVM training image are used to determine whether the current CU needs to be segmented. Reference [15] trained multiple support vector machines to predict the probabilities of different partitioning methods respectively, and skip the partitioning methods with lower probability. Reference [16] used a decision tree to predict the segmentation mode. Reference [17] used Bayesian classifiers to speed up segmentation. Random forest classifiers are also used for fast CU partitioning [18]- [20], which use its characteristics to reduce the risk of division errors.
In addition to neural networks, other machine learning methods need to manually design the way to extract features. Neural networks can learn the required features through gradient descent without manual intervention. Under the appropriate training set, their learning speed and accuracy are generally high in the way of manual design. Recently, deep learning methods using neural networks have developed rapidly in the field of video coding [21]. In deep learning, the most suitable for processing image information is CNN that can learn image spatial information. Due to the irregular shape and size of CUs, some methods based on CNN are used on these CUs. [22], [23] use an adaptive pooling layer to solve this problem. The adaptive pooling layer can compress feature maps of any size into a fixed size. References [24]- [26] Directly input fixed-size blocks into the network to predict the range of division depth, and terminate the RD calculation early through the depth range. Reference [27] converted the block structure into a hierarchical representation and directly predict the division of the entire Coding Tree Unit (CTU). Reference [28] trained three CNN classifiers to handle CUs of different depths and sizes. Reference [29] divide the CU into sub-blocks of the same size, and use CNN to predict the probability of each sub-block boundary as the division boundary to terminate the partial division mode early. In [30], the explicit VVC features (EVF) and the derived VVC features (DVF) that can be obtained during intra prediction are input into a lightweight neural network to determine which split modes to skip.
Because the VVC division method is more complicated, many methods that originally worked well on HEVC cannot be applied to VVC. We have designed a new fast partitioning method for coding blocks based on CNNs. This article divides the method into two steps. First, we train a CNN network to predict the probability that the edge of the 4 × 4 brightness block is the segmentation boundary. Then, it is divided according to the predicted probability.

III. PROPOSED METHOD A. STRUCTURE OF CNN
Most existing algorithms combine the CNN network to only calculate the probability of a single CU for the next segmentation. The common problem of these algorithms is that the network model needs to be called multiple times to make predictions during the same CTU division process, which will bring a large time overhead. Moreover, the shapes of the CUs that need to be predicted vary greatly, which will bring difficulties to the training of the network.
Because the 128 × 128 size CTU only allows QT division, the proposed algorithm divides a CTU into four 64 × 64 blocks, which are used as a 4 batch input network. We use this batch to predict the probability that the boundaries of all 4×4 blocks within each 64 × 64 CU are divided boundaries. Then, we determine the division situation based on these probabilities, and make full use of vectorization to save the time that the network needs to run. Each CTU only needs to call the model once. The proposed network refers to DenseNet [31], and the main part of the network is made up of sub-blocks of DenseNet structure. The size of the feature map output by each convolutional layer is the same.
The structure of the DenseNet block is shown in Fig.2. DenseNet brings the idea of skip connection in ResNet [32] into the mechanism. A large number of jump connections make the propagation of features and gradients more effective, and alleviate the problem of gradient disappearance that often occurs when the neural network is too deep. The input of each convolutional layer in the DenseNet block comes from the output of all convolutional layers before that layer. These outputs are subjected to concatenation operation to form a feature map with more channels, and then use 1 × 1 convolution to adjust the number of channels to 4 times the number of output channels, and then pass through a 3 × 3 convolution layer, and use the rule function to activate later as the output of this layer. Considering one 1 × 1 convolution and one 3×3 convolution as one layer, and we use 6 layers for each DenseNet Block. Concatenation the feature map output by each layer as the output of the block. Due to the good performance of the attention mechanism in various computer vision tasks, we try to add an ECA attention module after each DenseNet Block. ECA is a lightweight channel attention mechanism, and its structure is shown in Fig.3. ECA attention module can play a good effect in most computer vision tasks. However, in the experiment of this task, it does not seem to significantly improve the network performance. So in the end we didn't use it in the model.
It can be seen from Fig.4 that the blue block is a twodimensional convolution and Relu activation function with a 3 × 3 convolution kernel. The orange block represents DenseNet Block, which contains 6 convolutional layers using 1 × 1 and 3 × 3 size convolution kernels. The layers represented by red and green are collectively called the transition layer, which is used to compress the amount of data. Red represents the average pooling layer. Green represents the convolution kernel with 1 × 1 convolutional layer and Relu activation function. Gray represents the global pooling layer, which is used to transform the feature map into a vector.   Purple represents the fully connected layer, which is used to output the final result. The hyperparameters of the network are shown in Table 1.

B. DIVISIONAL JUDGMENT
According to the coordinates and shape of the upper left corner of the current block, we can obtain the vector values corresponding to the five dividing lines in the block. The probability of BT division is determined by the average value of the corresponding vector. The probability of QT division is the average of BT divisions in two directions. The probability of TT division is the larger of the corresponding two dividing line probabilities.
The probability of BTV is calculated as: The probability of BTH is calculated as: The probability of TTV is calculated as: The probability of TTH is calculated as: where x and y are the coordinates of the upper left corner of the current CU in the 64×64 block, and h and w are the height and width of the current block. P j+i is the value of the element with index j + i in the processed probability vector. The threshold we use is dynamic. When the size of the CU is larger, more vectors are involved in the calculation, and a small amount of prediction error causes less impact, so a higher threshold is used. When the size of the CU is small, there are fewer vectors involved in the calculation, so there is a greater risk, so a lower threshold is used. The size of a CU is usually related to its depth. A deeper CU usually has a smaller size. For the convenience of calculation, the threshold is set as below: where depth represents the depth of the current CU, which can be obtained from the currDepth property of the Partitioner class in the VTM. It is numerically equal to the sum of QtDepth and MtDepth. The process of embedding in VTM software is shown in Fig.6.
After the encoder divides each frame of image into CTU, it performs CNN prediction on the CTU currently to be encoded. After entering the CU encoding stage, try various encoding methods including 5 partitioning modes for the current CU. When trying the segmentation mode, the probabilities of various segmentation methods are calculated according to the area attributes of the current CU. If the probability of the currently tried segmentation mode is greater than the threshold, an attempt is made, otherwise the attempt of the segmentation mode is skipped. In order to reduce more coding time, we also skipped the case where the probability difference between the horizontal split and the vertical split in the BT and TT splits is large.

C. TRAINING
Because the value to be predicted is between [0, 1], we use the cross-entropy loss function, which is defined as: where m is the number of samples in a batch, n is the number of elements in each sample, p(x i,j ) is the true value of the j − th vector of the i − th sample, q(x i,j ) is the corresponding predicted value. Adam optimizer is used [34] to perform gradient descent on CNN. The training process uses the pytorch1.7.0 framework in the python3.7 environment, and the learning rate adjustment strategy selects the cosine annealing strategy (CosineAnnealingWarmRestarts). We trained 20 generations on GTX1650 GPU. Batchsize is 16.
For the input data set of 64 × 64 luminance blocks and their corresponding label vectors, 100 images are extracted at equal intervals from the 800 HR samples of the Div2k [35] dataset used to train the super-resolution network, and these FIGURE 6. Process of skipping CU split, CNN means using a trained CNN model for prediction .Pqt, Pbth, etc. represent the probability of the corresponding split mode. T is the preset threshold. Other modes indicate prediction modes other than CU split, such as intra prediction, inter prediction, and so on. After executing the Split operation, it will enter the sub-CU and repeat all operations after CNN (not including CNN inference).
images are divided into 64 × 64 blocks as data set. The luminance signal of the 64 × 64 luminance block is obtained from the VTM when the picture in the data set is encoded. These data sets consist of static images, because the proposed solutions are mainly used for ''AI'' configuration. This has better diversity than the video sequence dataset. In order to test the generalization ability of the network and the effect of real coding, we did not use the common test conditions (CTC) [36] sequence for training. The input data set is coded by the VTM 10.0 software under the ''All Intra'' configuration to establish the corresponding label. QTMT partition information is collected for each 64 × 64 CU and convert it to the output format of CNN. The label consists of a one-dimensional vector of 480 elements consisting of 1 (for splitting boundary) and 0 (not for splitting boundary). The partition information comes from the code stream analysis tool (DecoderAnalyserApp) in VTM.

IV. EXPERIMENTAL RESUTS
This section introduces the experimental setup in detail and compares our results with several advanced technologies. The coding quality is measured by BDBR and complexity reduction, and the coding time saving rate ( T) is determined as: 27,32,37 Among them, T OC is the reference coding time of the VTM10.0 anchor point, and T SC is the coding time of our algorithm. We counted the inference time during the test phase of training the network. When the network runs on the CPU and the input size is 4 × 64 × 64, the inference time of the network is about 0.18 seconds. Due to the different performance of different platforms, this time does not count the time spent by the neural network.

B. RESULTS AND ANALYSIS
Because VTM 4.0, VTM 5.0 and VTM 10.0 use the same CU splitting scheme, our comparison with [11] and [19] is reasonable. We conducted experiments on the role of the ECA module in the proposed algorithm, as shown in Table 2. T in Table 2 does not include CNN inference time. According to the data in Table 2, it can be seen that whether the ECA module is used has no obvious impact on the final performance. On BDBR, the scheme using ECA module only reduces the average loss by 0.02%. In most highresolution sequence tests, the BDBR loss of the scheme using the ECA module is higher than that of the scheme not using this module. In the coding time comparison that does not consider the network model inference time, the scheme using the ECA module does not show obvious advantages. Even if the ECA module is a lightweight attention module, the global average pooling operation will still take a lot of time. After experiments, the ECA module will increase the inference time of CNN by about 10% when performing calculations on the CPU. In summary, the ECA module is not suitable for this solution.
Our algorithm is only applied to a 128 × 128 CTU. Our algorithm saves less time on low-resolution sequences, because the proportion of the image area occupied by the CTU with a size of less than 128 × 128 at the boundary of the low-resolution image is larger than that of the high-resolution image and our training set only images with 2K resolution are included. But on the high-resolution A1 and A2 sequences, the proposed algorithm saves 74.28% of the coding on average, and only brings a 1.925% BDBR loss. For the low-resolution C-D, the proposed method can reduce the complexity by 27.39% and increase the BDBR by 1.01%. The training database consists of 2K images, so the proposed CNN performs better on high-resolution sequences. At the same time, since the training set does not contain pictures with text content, there is a higher BDBR loss in the sequence where the main content is text in the E class. Table 3 shows how our method compares with other methods. The data of other algorithms are calculated from the data given in their papers. Because the sequences adopted in different papers are different, we only make comparisons between classes. Compared with [11], our method can save more coding time in most cases. In the higher resolution sequences of A1 and A2, even if the reasoning time of the network is added, the time saved by our scheme is still much higher than [11], and the BDBR loss gap is within an acceptable range. When ignoring the network inference time, our method outperforms [19] on high-resolution sequences. When the reasoning time is added, our method is better than [19] on the A2 class, and the time saving is very close to [19], while the BDBR loss is significantly better than [19] on the A1 class. Our algorithm saves less time when the resolution is lower because we only predict blocks that meet the size of 128 × 128. In most of the sequences, we obtained a time saving rate much higher than [30] under an acceptable BDBR loss.
The method is divided into two parts: prediction and decision-making. The running time of the decision-making part is extremely short and can be ignored. The time for the prediction part using the neural network is shown in Fig.7. The running time of the prediction part occupies an average of 6.5% of the original VTM encoding time. And this ratio can be further shortened by means of model quantification or graphics card acceleration.

V. CONCLUSION
This paper proposes a CU split acceleration scheme based on DenseNet network. CNN is used to analyze the texture in every 4 64 × 64 coded blocks, and predict the probability that each 4 × 4 block in these blocks is a partition boundary. Starting from the probability of the boundary, the segmentation probability is derived and compared with the preset threshold. Compared with the original encoding time of VTM, the execution time of CNN is shorter. In the ''All Intra'' configuration under VTM 10.0, the proposed solution reduces the complexity by 46.10%, and BDBR slightly increases by 1.86%. When using high-resolution sequences, the acceleration effect is higher, up to 64.81%, but requires 1.92% of the BDBR overhead. These results prove the effectiveness of the proposed method and motivate us to conduct further research and analysis. Since the training set is small and the network is not optimized to the optimum, the proposed solution still has a lot of room for improvement. We will try to use a larger data set for training and further optimize the structure of the network. At the same time, we will also try to improve the performance of our method on low-resolution sequences. RIJIAN SU received the Ph.D. degree in control science and engineering from Huazhong University of Science and Technology, Wuhan, China, in 2010. He is currently a Professor with Zhengzhou University of Light Industry, China. His current research interests include image processing, Bayesian estimation, and intelligent computation. VOLUME 9, 2021