Mode Information Guided CNN for Quality Enhancement of Screen Content Coding

Video quality enhancement methods are of great significance in reducing the artifacts of decoded videos in the High Efficiency Video Coding (HEVC). However, existing methods mainly focus on improving the quality of natural sequences, not for screen content sequences that have drawn more attention than ever due to the demands of remote desktops and online meetings. Different from the natural sequences encoded by HEVC, the screen content sequences are encoded by Screen Content Coding (SCC), an extension tool of HEVC. Therefore, we propose a Mode Information guided CNN (MICNN) to further improve the quality of screen content sequences at the decoder side. To exploit the characteristics of the screen content sequences, we extract the mode information from the bitstream as the input of MICNN. Furthermore, due to the limited number of screen content sequences, we establish a large-scale dataset to train and validate our MICNN. Experimental results show that our proposed MICNN can achieve 3.41% BD-rate saving on average. In addition, our MICNN method consumes acceptable computational time compared with the other video quality enhancement methods.


I. INTRODUCTION
With the rapid development of intelligent terminal technology, mobile devices such as smartphones and tablets have made Screen Content (SC) video more and more widespread. Desktop collaboration, screen sharing, cloud gaming, etc., greatly increase the scope of video applications. Especially due to the recent spread of COVID-19, the demand for online education and virtual conferences is rapidly increasing, with Screen Content Coding (SCC) [1] playing a critical role. Unlike the natural video sequence, as shown in the example of Fig. 1(a), captured by a camera, the screen content sequence as in Fig. 1(b) can be generated from different mobile terminals directly. It is composed of many static or moving computer-generated images and texts. It often contains many uniform and flat areas, repeated patterns and limited pixel colors, etc. By making use of these screen The associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar . content characteristics, SCC [1] is proposed as an extension of High Efficiency Video Coding (HEVC) [2] to increase the coding efficiency. In addition to the conventional HEVC intra (INTRA) mode [3], the SCC standard adopts two dedicated coding modes, Intra Block Copy (IBC) and palette (PLT) [4], [5], [6]. IBC [4] uses the reconstructed block of the current frame as the prediction block. IBC performs motion compensation for the current Coding Unit (CU) in the reconstructed region of the current frame, which can improve the compression efficiency of screen content video by more than 30% in [5]. PLT enumerates the color value for each coding block to generate a color table and passes an index for each sample to indicate which color in the color table it belongs to. With PLT, compression efficiency is further improved by 15% over the original code with IBC mode [5].
Although the coding efficiency can be improved by introducing the coding tools, screen content videos still contain compression artifacts corresponding to the dedicated tools in the SCC standard.  An HEVC codec utilizes a deblocking filter (DF) and a sample adaptive offset (SAO) to eliminate blocking and ringing artifacts, thereby enhancing the quality of the reconstructed frames. In recent years, deep learning has made new progress in this field and has achieved impressive performance in video enhancement. A series of neural network architectures were proposed to remove the artifacts in reconstructed videos. Examples include an In-loop Filter using the Convolutional Neural Networks (IFCNN) [7], a Variablefilter-size Residue-learning CNN (VRCNN) [8], a Deep CNN based Auto Decoder (DCAD) [9], a Multi-layered Deep CNN (MDCNN) [10], and a Decoder-side Scalable CNN (DS-CNN) [11]. Unlike other architectures that replace the in-loop filter, DCAD and DS-CNN were designed to improve the video quality at the decoder side. The advantage of these post-processing methods is that there is no need to modify the HEVC codec inside. Hence, the structure proposed in this paper focuses on video post-processing at the decoded side, as depicted in Fig.2.
In addition to the development of network structures, the rich side information in the video bitstream can also help to guide the enhancement process of decoded videos. For example, it was found in [12] that the partition tree in the coding process indicates the corresponding compression loss of the decoded video. To utilize the side information in the HEVC codec, the work in [12] subsequently proposed a doubleinput network by taking the partition mask into account. The mask is generated based on the partition tree of HEVC, as the side information. With the use of the partition mask, the blocking effect is eliminated. However, this approach is designed for natural videos. It still ignores the characteristics of the screen content video. In other words, the utilization of side information is not closely related to the screen content characteristics.
In summary, the novelty and contributions of our work are twofold: • We propose a novel post-processing network for enhancing decoded screen content videos based on the coding mode information embedded in the coded bitstream. Three binary mode masks derived from the dedicated coding tools in SCC are fused with the corresponding decoded frame.
• We establish a large-scale dataset containing 9810 frames for screen content videos. This dataset will be publicly available to facilitate further research. The remainder of the paper is organized as follows. The related works are provided in Section II. In Section III, we describe the generation of the proposed mode mask and the details of our proposed network architecture. Experiments and ablation studies are brought in Section IV. Section V concludes this paper.

II. RELATED WORKS A. DEEP LEARNING-BASED VIDEO QUALITY ENHANCEMENT
In recent years, deep learning has been successfully applied to computer vision tasks. Many works have been applied to improve the visual quality of HEVC videos. They are divided into two major approaches. One is to modify the internal module within the codec, such as in-loop filtering for visual quality enhancement [8]. Another approach uses post-processing techniques to improve the video quality after the decoder [9], [11]. For the former one, the HEVC standard specifies an in-loop filter [2], which comprises a deblocking filter and a sample adaptive offset (SAO). The in-loop filter is embedded in the encoding and decoding loops, after inverse quantization and before saving the decoded picture in the decoded picture buffer to improve image quality. The work in [7] suggested a new in-loop filtering technology in HEVC using convolutional neural networks (CNN), namely IFCNN, to replace the SAO filter for coding efficiency and subjective visual quality improvement. Inspired by Deep Convolution Networks for Compression Artifacts Reduction (ARCNN), Dai et al. [8] proposed a Variable-filter-size Residue-learning CNN (VRCNN), which can improve the visual quality of HEVC videos without increasing the bit rate compared to the original in-loop filter in HEVC. However, the above networks cannot be directly applied to compressed videos, as they were designed as a part of the HEVC encoder. For the latter approach, a Deep CNN-based Auto Decoder (DCAD) [9] was developed to improve the visual quality through deep learning only at the decoder side. Later, a Decoder-side Scalable CNN (DS-CNN) was proposed by Yang et al. [11] wherein there are two subnetworks, DSCNN-I and DSCNN-B, to reduce the artifacts of intra-coded and inter-coded frames, respectively. In [13], QE-CNN-I and QE-CNN-P were also proposed to enhance the intra-coded frames and inter-coded frames, respectively. In [14], Huang proposed a cross feature fusion framework to enhance the gaming video in the decoder side. Notably, these works only focus on the decoded frame as the input of the network. They do not consider the information extracted from the bitstream.

B. DUAL-INPUT CNN ON VIDEO QUALITY ENHANCEMENT
Recently, a dual-input network has been proposed for visual quality enhancement on natural videos. The beginning of the dual-input network in Fig. 3 is composed of two branches -the main branch and the mask branch. The mask branch utilizes the compressed information extracted from the bitstream as side information for the neural network. The main branch and the mask branch in Fig. 3 are fused at certain position within the network. The post-processed frame with reduced artifacts is finally obtained. For instance, a partitionaware convolution neural network was proposed in [15], which uses the partition information produced by the encoder to assist post-processing at the decoder side. In particular, it adopts the boundary mask and the mean mask to guide the neural network. In Fig. 4(a), the boundary mask represents CU partition information by setting the CU boundary region as 255 and the non-boundary CU region as 0. The mean mask, as shown in Fig. 4(b), represents CU partition information by filling each CU with the mean value of all pixel values within each CU. Either one of these two masks can be input into the model in Fig. 3 as a grayscale image. Inspired by He et al. [12], another dual-input model proposed by Hoang and Zhou [16], a Deep Recursive Residual Network with Block information (B-DRRN), also employs the mean mask as side information. However, these dual input networks only focus on natural videos and do not consider specific features of screen content. In contrast, this paper proposes a novel multi-input CNN that utilizes decoded frames with the mode information of SCC as the input. The idea is to utilize three binary masks, including the information of IBC mode, PLT mode, and INTRA mode to further enhance the quality of screen content videos.

III. PROPOSED MULTI-INPUT CNN FOR VIDEO ENHANCEMENT
In this section, we describe our network architecture in detail. The framework of the proposed MICNN is shown in Fig. 2. To exploit the side information from the bitstream, we propose three binary masks dedicated to screen content videos. This is the first work to enhance the SC quality using deep learning with the help of mode information as the binary mask input into the deep network.

A. MOTIVATION
Owing to the block dividing process and quantization in HEVC, the artifact of decoded video corresponds highly to the CU information. Because of that, there are some important clues contained in CU information that can be used to eliminate the artifact of decoded videos. Recently, the works in [12] and [15] have proven that using the mean mask or the boundary mask can achieve better performance in the postprocessing method.
However, screen content videos have different characteristics to natural videos, they often contain many uniform and flat areas, repeated patterns, and limited pixel colors. CU information cannot represent these characteristics. Therefore, various mechanisms of video quality enhancement are required for these different types of content. To identify natural content and screen content such that our MICNN can effectively enhance the reconstruction quality of different contents, it can be guided by the coding mode. Fig. 5 explains the relationship between content type and coding mode. Fig. 5(a) shows a frame with mixed content, and  IBC and PLT are designed for screen content: (1) IBC can find almost exact matching for certain CUs within the same frame due to the massive existence of texts and computergenerated graphics, and (2) PLT can well handle the CUs with only a few distinct colors. Therefore, the coding modes embedded in the coded bitstream are good candidates for identifying CU content types that can be used to guide the video quality enhancement in screen content videos. In the following section, we propose to use three binary mode masks devised by different coding modes, IBC, INTRA, and PLT, in our new MICNN to improve the visual quality of screen content. Through the input of mode information, MICNN can eliminate different artifacts of decoded screen content videos according to the content encoded by different coding modes.
(3) Figure 6 shows the examples of the IBC binary mode mask, PLT binary mode mask, and INTRA binary mode mask based on the assigned values using (1)- (3).
The baseline CNN architecture is shown in Fig.7(a), where our proposed mode information guided CNN (MICNN) is adopted. The MICNN architecture consists of three components, i.e., feature extraction, feature fusion, and reconstruction. In the feature extraction stage, one main branch and three sub-branches are used to extract features. The decoded frame is fed into CNN through the main branch and the binary mode masks M IBC , M PLT , and M INTRA are the inputs of the three sub-branches.
The binary mode masks are the side information. They are fed into the neural network and combined with the decoded frame. Therefore, the order of the three binary mode masks fused in the neural network are considered, and ablation study related to various orders will be made later. From Fig. 7(b), we can see the detail of our proposed fusion method. The features extracted from different binary mode masks will be added to the feature extracted from decoded frame in order.
Moreover, Residual Dense Blocks (RDBs) represented in Fig. 7(c) are stacked as the main branch of the proposed MICNN. As shown in Fig. 7(c), the RDB contains three groups of convolutional layers that are in dense connection [17]. Each group consists of two convolutional layers with a size of 3 × 3 and two ReLU activation functions. Meanwhile, the residual connection in each RDB is employed to reduce the gradient vanishing problem and help the backpropagation. Compared with the original residual block as shown in Figure 7(d), RDB uses dense connection which can exploit hierarchical features.
To formulate the MICNN model proposed in Fig. 7(b), it is assumed that the decoded and enhanced frames are represented byD andỸ , respectively. The composite non-linear mapping including convolutional operation and activation function (ReLU) is denoted as H cr (·). In addition, the RDB is denoted as H RDB (·). The output of the main branch in the feature extraction stage can then be obtained bỹ The output of the sub-branches in the feature extraction stage can be formulated as: wherem ibc ,m intra , andm plt are defined as the feature maps of the IBC mode mask, INTRA mode mask, and PLT mode mask, respectively. These feature maps are then integrated into the main branch in the feature fusion stage, which can be formulated as:ỹ ibc =ỹ +m ibc (8) whereỹ ibc ,ỹ intra , andỹ plt denote the output after adding the IBC mode mask, the INTRA mode mask, and the PLT mode mask in order, respectively. Finally, the reconstructed frame can be generated as: where H c (·) denotes the convolutional operation. The proposed network is trained in an end-to-end manner. To optimize our model, we apply Mean Squared Error (MSE) as the loss function. Given a training set , where N is the number of patches in the training set. Here, Y i is the ground truth patch of the decoded patchD i and the set are the patches of mode information. The loss function can be formulated as: where H (·) denotes our proposed network and θ denotes all the parameters.

IV. PROPOSED POLYUSCC DATASET
The work of this paper mainly focuses on video quality enhancement of SC sequences. However, the number of SC sequences is limited. To avoid overlapping with the sequences provided in the Common Test Condition (CTC) [19], SC sequences were gathered from other sources [18], [20], or self-capture [21]    are adopted, as shown in Table 1. These sequences can be divided into three types: text and graphics with motion (TGM), animation (A), and mixed (M) content. The mixed content contains natural content and screen content. The text and graphics with motion (TGM) consists of text, graphic and animation. The animation (A) only contains the gaming content. To make the database focusing on the different types of screen content, the number of TGM sequences is twice the amount of the mixed content. The dataset consists of three parts. First, to guarantee data reliability and availability, half of the dataset (15 sequences) are provided from the JCT-VC [18] but not included in CTC [19]. Second, there are 5 SC sequences from Tsang et al. [20]. To enrich the text and graphics with motion content and mixed content sequences, we further capture 14 video sequences by ourselves. Some examples of our self-captured videos are represented in Fig. 8. Our self-captured sequences will be published on website [21]. During the evaluation of the proposed MICNN, 27 sequences are used for training and the remaining 7 sequences are used for validation, as shown in Table 1. We randomly selected one patch from one frame for each iteration. To guarantee the robustness of our dataset, we select all frames in our training process. In our experiments, the learning rate was set to 0.0001 for QP37. We fine-tuned the learning rate as QP decreases. The adaptive moment estimation (Adam) optimization method was used to train the model for 500 epochs. A computer equipped with Windows 10 operating system, Intel i9-10900K CPU, 64 GB RAM, and NVIDIA 3090Ti GPUs was used to perform the model training.

A. EXPERIMENTAL SETTING
The test set contains 12 video sequences provided in the CTC [19], none of which is the same as the training set and validation set. This is essential to avoid overfitting issue.

B. ABLATION STUDY
As mentioned in Section III, the order of the three binary mode masks fused in our proposed MICNN will affect performance. An ablation study was conducted to decide the order of the three binary mode masks and verify the necessities and the generalization ability of our proposed masks. Various MICNN architectures were compared to find the optimal order of inputting binary mode masks. It includes all possible combinations as in Table 2 : ibc-plt-intra, plt-ibc-intra, intraplt-ibc ibc-intra-plt intra-ibc-plt, and plt-intra-ibc. These notations represent different orders of the binary mode masks by name. For example, ibc-plt-intra means first use the IBC mode mask, then add the PLT mode mask, and finally use the INTRA mode mask. Furthermore, to verify the superiority of our mode mask, we input the mean mask proposed in [15] into the baseline model in Fig. 7(a) with the same number of layers and the same training process. The PSNR improvement   Table 2. It can be seen that ibc-intraplt can achieve the highest PSNR improvement (0.629 dB) over the SCC baseline at QP=37. So, we will use the order of ibc-intra-plt to compare other enhancement algorithms in the following discussions. To further verify the efficiency of our proposed fusion approach as in Fig. 9(a), we also evaluated two different fusion strategies -Early Fusion by Concatenation (EFC) and Late Fusion Concatenation (LFC), as shown in Fig. 9(b) and Fig.9(c), respectively. In EFC, we concatenate the decoded frame and binary mode masks as the input. The main branch of EFC is the same with our proposed MICNN. On the other hand, the subbranch of the LFC is the same as our proposed MICNN. As compared with MICNN, LFC concatenates all feature maps of decoded frame and binary mode masks before the feature reconstruction stage. The PSNR improvements for various fusion strategies are shown in Table 3. It can be seen that our proposed fusion strategy can achieve the highest PSNR improvement and it can make better use of the mode information. In Table 4, to further verify the contribution of our proposed mode masks, we remove the intra mode mask, ibc mode mask, and plt mode mask, respectively. The result shows that the best performance can be achieved when the three mode masks are adopted.

of various combinations on the validation set under AI configuration is shown in
To verify the power of feature extraction of the RDB, we employed the traditional Residual Block as shown in Figure 7(d) and the traditional Dense Block [17] instead of the RDB for compression. The results are shown in Table 5, the RDB can achieve the highest PSNR performance. Combining residual block and dense connection can help to extract the VOLUME 11, 2023    feature and keep the high frequency details. The reason is that the residual connection can prevent the gradient vanishing and the dense connection can reuse the feature from previous layers.

C. OVERALL PERFORMANCE 1) OBJECTIVE VISUAL QUALITY ASSESSMENT
In this section, we compare QECNN [13], DCAD [9], Partition-aware CNN [15], and QECF [14] with our proposed MICNN. Table 6 and Table 7 show the average PSNR improvement ( PSNR) and the average SSIM improvement ( SSIM), respectively, over all frames of each test sequence. In these two tables, the best PSNR/SSIM improvement is highlighted in bold and the underline number is the secondbest PSNR/SSIM improvement. We can see that our proposed baseline and MICNN outperform other methods in most cases. Meanwhile, the proposed MICNN achieves better performance than the proposed single input model. It demonstrates the benefit of using our proposed SCC mode masks.
When QP is 37, the highest PSNR improvement of our-MICNN approach reaches 1.20 dB, i.e., for sequence scwebbrowsing. The average PSNR of our MICNN approach is 0.58 dB, which is 0.03dB higher than that of our baseline model (0.55 dB), 0.41dB higher than that of QECF (0.17 dB), 0.18dB higher than that of Partition-aware CNN (0.40 dB), 0.14dB higher than that of DCAD (0.44 dB), and 0.27dB higher than that of QECNN (0.31 dB). It is noted that QECF includes some specific idea to enhance gaming content. However, it is found that our proposed method can also handle gaming content and text content. Compared with the QECF, our MICNN can achieve an acceptable PSNR improvement (0.14dB) and SSIM improvement (0.0012) in gaming content sequence scrobot and outperform other sequences. In addition, PSNR curves of three pure screen content videos for DCAD, QECNN, partition-aware CNN, QECF, and our proposed MICNN are shown in Fig. 10. The scdesktop is mixcontent. The scwebbrowsing and scflyingGraphics are pure screen content. By utilizing the proposed binary mode masks, MICNN can achieve highest PSNR in each frame of different content. That means our proposed method is robust.
BD-rate [22] is used to indicate the bitrate savings of these models under the equivalent PSNR. Experimental results are compared and tabulated in Table 8. It shows that our proposed MICNN can achieve higher BD-rate savings than its corresponding baseline. Again, this demonstrates the effectiveness of using mode masks. Our MICNN obtains an average BD-rate savings of 3.41%, while the second-best method achieves an average BD-rate savings of only 2.97%. For the test sequence scSlideShow, up to 6.76% BD-rate saving is obtained for the Y component under AI configuration. We conjecture that our MICNN well exploits the mode information to further enhance the decoded frame quality and reduce the BD-rate.

2) SUBJECTIVE VISUAL QUALITY COMPARISON
This section compares the subjective quality of different models. Fig. 11 shows the subjective visual quality performance of various models on the sequences scSlideShow, scprogramming, and scflyingGraphics with QP = 37. From this figure, we can see that the reconstructed frame of HM16.20-SCM8.8 has obvious compression artifacts, which cannot be completely removed by DCAD, QECNN, or QECF. As shown in Fig. 11, our MICNN eliminates the artifacts more effectively than other models. For scSlideShow and scprogramming, it can be observed that the character is blurry, and there are some blocking artifacts in the background around the character, but it becomes clearer after being processed by our proposed MICNN. For scflyingGraphics, the lines are blurry in the reconstructed frame but becomes shaper in MICNN. In addition, in the reconstructed frame, the flat areas around the lines contain many artifacts. MICNN can smooth these areas. All these examples in Fig. 11 show that MICNN is superior to the other models in terms of subjective visual quality. There are no uneven regions at the VOLUME 11, 2023 CU boundary and no visual blocking effect from the frame processed by MICNN. This again shows that our MICNN can make use of the mode information to enhance the decoded frame quality subjectively.

3) QUALITY ENHANCEMENT AT VARIOUS QPs
To verify the generalization ability of the MICNN model on various QPs, we additionally encode all test sequences at QP of 24, 29, 34, 39 when the model is trained at different QPs, i.e. QP=22, 27, 32, and 37. The performance in terms of PSNR is shown in Fig. 12. Fig. 12(a) shows the PSNR improvement of the model trained at QP = 22 and tested at QP = 22 and 24. In Fig. 12(b), the model is trained at QP = 27 and tested at QP = 27 and 29. Similarly, Fig. 12(c) and Fig. 12(d) show PSNR of the model trained at QP =32 and 37 and tested at different QPs = 32 and 34, 37 and 39, respectively. As shown in this figure, each trained model can obtain acceptable quality enhancement on decoded videos at adjacent QPs, which verifies the generalization ability on various QPs.

4) COMPARISONS ON COMPUTATIONAL COMPLEXITY IN DECODER
To evaluate the computational complexity of various models, we follow the measurement metric of other post-processing algorithms [11], [15] by computing the running time per Coding Tree Unit (CTU) at the decoder side. Experiments were conducted using Intel i9-10900K CPU, 64 GB RAM, and NVIDIA 3090Ti GPUs. Fig. 13 shows the average PSNR against running time per CTU for MICNN, DCAD [9], QECNN [13], QECF [14], and partition-aware CNN [15] methods. The results shown in this figure are calculated over all the test sequences on average. In Fig. 13, the running times of DCAD, QECNN, QECF, partition-aware CNN are 0.40 ms per CTU, 0.66 ms per CTU,0.91 ms per CTU, and 2.20 ms per CTU, respectively. On the other hand, our proposed MICNN model consumes approximately 1.08 ms per CTU but achieves the highest PSNR improvement over other models. In Table 9, we also compare the overall time consumption in enhancing one frame in different resolutions of different methods. From Table 9 and Table 10, we can observe that the performance improvement of our MICNN consumes a reasonable amount of computational time compared to QECNN and DCAD. Moreover, MICNN outperforms partition-aware CNN in both running time and PSNR.

5) MODEL SIZE
Model complexity in terms of model size for various models is also evaluated in Table 10. Model size reflects the number of network parameters. Compared to our baseline model, MICNN adds sub-branches to improve performance without significantly affecting model size. Besides, the proposed MICNN can achieve higher performance than the partitionaware CNN, but with smaller model size. It can be concluded that our MICNN obtains better tradeoff between coding efficiency and model size. In other words, our MICNN is more model-efficient.

VI. CONCLUSION
By integrating our proposed binary mode masks into a mode information guided deep network model, SCC modes extracted from the bitstream can be utilized to further improve SC video quality. Specifically, the new branch uses the binary mode masks, which are based on the coding modes of SCC, to exploit the characteristics of SCC, and then guide the neural network for quality enhancement on screen content videos. This is the first work to incorporate the SCC mode information into the sub-branches for enhancing SC quality. Experimental results show that our proposed MICNN is more effective than other networks. We believe that our mask branches can be easily adopted to different single-input models for further quality enhancement of SCC. In the future, we will move to create a real-time model, which is essential for the further development of real-time applications. ZIYIN VOLUME 11, 2023