Offset-Based In-Loop Filtering With a Deep Network in HEVC

With the great flexibility and performance of deep learning technology, there have been many attempts to replace existing functions inside video codecs such as High-Efficiency Video Coding (HEVC) with deep-learning-based solutions. One of the most researched approaches is adopting a deep network as an image restoration filter to recover distorted compressed frames. In this paper, instead, we introduce a novel idea for using a deep network, in which it chooses and transmits the side information according to the type of errors and contents, inspired by the sample adaptive offset filter in HEVC. A part of the network computes the optimal offset values while another part estimates the type of error and contents simultaneously. The combination of two subnetworks can address the estimation of highly nonlinear and complicated errors compared to conventional deep- learning-based schemes. Experimental results show that the proposed system yields an average bit-rate saving of 4.2% and 2.8% for the low-delay P and random access modes, respectively, compared to the conventional HEVC. Moreover, the performance improvement is up to 6.3% and 3.9% for higher-resolution sequences.


I. INTRODUCTION
With the continuous development of electronic devices such as smartphones and digital TVs, video plays increasingly important roles in our lives. At the same time, because of the rapid development of network technology, the amount of video traffic in the network is rapidly increasing. Besides an increase in the number of videos, the video resolution is becoming larger because of the increased size of the display devices. Therefore, there is heightened demand for video compression technology that maintains high quality with fewer bits.
At this writing, the most recent standard video codec is High Efficiency Video Coding (HEVC) [1]. Although its encoding and decoding processes are similar to those of previous standard codecs, such as H.264/AVC [2], it is much more efficient and faster because of its advanced algorithm and parallel processing. In detail, with the same video resolution, HEVC requires 40-50% fewer bits than The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei.
H.264/AVC to yield similar-quality video. It is composed of dozens of modules, such as motion compensation, discrete cosine transform (DCT), quantization, in-loop filtering, and context-adaptive binary arithmetic coding (CABAC) [3]. Among them, we focused on in-loop filtering in this study. The in-loop filter is designed to reduce compression artifacts caused by block-wise quantization and at the same time achieve bit-rate savings by restoring a damaged frame as close as possible to the original. It consists of a deblocking filter (DF) [4] and a sample adaptive offset (SAO) filter [5]. DF reduces blocking artifacts caused by block-wise processing for the coding unit (CU). On the other hand, SAO reduces ringing artifacts that occur by reducing high-frequency components during quantization.
Recently, deep-learning approaches have attracted much research attention [6]. They have exhibited superior performance in many areas, including conventional computer vision tasks such as classification [7], and conventional image processing tasks such as image restoration and super resolution [8], [9]. As an early work in image restoration with deep learning, the super-resolution convolutional neural network (SRCNN) [8] used a simple structure, but its result was impressive. The very deep super resolution convolutional network (VDSR) highly improved the performance by deep residual network architecture [9]. Inspired by the success of SRCNN and VDSR, many researchers have attempted to apply deep-learning techniques to reduce compression artifacts. As a simple modification of SRCNN, Yu et al. first proposed a network for removal of still image compression artifacts, called the artifacts reduction convolutional neural network (ARCNN) [10]. ARCNN directly targets JPEG [11] artifact removal, but its structure is similar to that of SRCNN. To further improve its performance, the deep dual-domain convolutional neural network (DDCN) [12] was proposed to handle the characteristics of DCT for JPEG using two separate networks for the pixel and DCT domains.
Likewise, it is believed that deep-learning approaches can be a good solution to reduce video compression artifacts, and researchers have attempted to improve video compression performance by applying deep learning in various ways. For example, a deep-learning-based network can replace conventional modules or functions, such as intra prediction, motion estimation, and in-loop filtering.
As mentioned previously, we mainly focused on in-loop filtering in this study. We propose a new network that replaces the in-loop filter, in particular the SAO of HEVC, both in the encoder and the decoder. Unlike previous deep-learningbased in-loop filtering methods that recover the distorted frames only, the proposed scheme computes and transmits offsets as side information that significantly improves the prediction accuracy of filtering with the slight sacrifice of additional bits.
The rest of this paper is organized as follows. We review related works in Section II. Section III describes the motivation and the details of the proposed method. Section IV discusses how we trained the network and the data we used for training. The results of the trained network and the analysis are discussed in Section V. Finally, we conclude this paper in Section VI.

II. RELATED WORKS
There are two ways to apply the filter-based methods in HEVC to improve coding efficiency. One is applying the filter to decompressed frames to recover the image as close as possible to the original. The coding artifact removal method belongs to this category. Another is applying the filter to intermediate reconstructed frames. Both methods attempt to recover the reconstructed image, but there is major difference between them. The latter uses the filtered image as a reference for the prediction of other frames, whereas the former does not. Therefore, the former is called post-processing, and the latter is called in-loop filtering. For example, both DF and SAO in HEVC are in-loop filtering methods.

A. POST-PROCESSING
Many researchers have been studying the use of deep-learning technology as post-processing first, because this approach does not require modification of the encoder or decoder.
In addition, post-processing for the coded image is a kind of image restoration problem. It enables researchers to easily apply a well-known deep-learning network to the codec. Inspired by SRCNN, CNN-based filtering [13] was first proposed for video compression by simply using three convolution layers to improve coding efficiency. It exhibits considerable improvement in All Intra (AI) mode, but the experiments were conducted only with low-resolution video. Moreover, their results for Low Delay P (LDP) and Random Access (RA) modes were unsatisfactory. Variable-filter-size residual-learning CNN (VRCNN) [14] suggested a more complicated network, which uses various filter sizes to match various CU sizes in HEVC. However, it is still effective on AI mode only.
To apply the deep-network approach to inter frames, such as P-and B-frames, the decoder-side scalable convolutional neural network (DSCNN) [15] was proposed; it uses different architectures for intra-and inter-coded frames. The network for intra frames, called DSCNN-I, consists of five simple convolution layers, and the network for inter frames, called DSCNN-B, also consists of five convolution layers. However, DSCNN-B uses additional inputs at each layer, and those are from DSCNN-I. This means that the output of the previous layer and the output from the layer of DSCNN-I are concatenated to form a new input format for the present layer. It mimics the conventional video codec, where an inter frame uses nearby intra frames as its reference.
Unlike those networks with complicated architecture, two networks with simpler architecture using VDSR were proposed. Li et al. [16] only used 20 convolutional layers, which leads on average 1.6% BD-rate reduction compared with HEVC baseline in 2017 ICIP Grand Challenge. The deep convolutional neural network-based auto decoder (DCAD) [17] was proposed to use only ten convolution layers, but the results become much better than others because of its use of a residual block structure [6]. It also shows good results not only for AI mode, but also for LDP and RA mode.

B. IN-LOOP FILTERING
In-loop filtering itself is not different from post-processing, but its effect and analysis are more complicated because it affects not only the current frame, but also the other frames that refer to the current frame. The spatial-temporal residue network (STResNet) [18] was proposed as an in-loop filter, located after SAO. It uses both spatial and temporal information during network training to make it work for intra-and inter-compressed frames, but the results showed only a small improvement. Jia et al. [19] proposed a content-aware CNN with multiple networks. They classified training samples with the clustering method, and trained networks for each set of clustered training samples. When incorporating the network into the HEVC, it must send an additional symbol to indicate which network is used. This process does nothing but train each network according to the characteristics of each sequence. The multi-layer deep convolutional neural network (MDCNN) [20] uses convolution layers and a symmetric deconvolution layer by replacing SAO. However, it works only for AI mode, and it uses the same offsets made by SAO.
The residual highway convolutional neural network (RHNet) [21] was designed as in-loop filter, following DF and SAO. The network consists of highway units, and each of them contains three layers and identity skip connection as shortcut. The dense residual convolutional neural network (DRN) [22] was also proposed as in-loop filter. It consists of dense residual unit (DRU), where both ResNet [6] and DenseNet [23] features are used. It is embedded between DF and SAO in HEVC. The progressive rethinking network (PRN) [24] is composed of the progressive rethinking block (PRB) and the side information feature extractor (SIFE). PRB aims to keep high dimensional representative features, and SIFE aims to extract multi-scale mean value of CU as side information to improve network.
However, the in-loop filtering methods mentioned so far exhibit inferior performance improvement compared to postprocessing methods, and their usage is often limited. One reason is that the state-of-the-art image restoration network design is well fitted to post-processing, whereas their simple adoption to in-loop filtering is inefficient. Instead, it is required to fully utilize a different design for in-loop filtering, and it is the starting point of our proposed scheme.

III. PROPOSED METHOD A. RESEARCH MOTIVATION
As stated previously, HEVC has two in-loop filters: DF and SAO as in Fig. 1(a). Both filters are adopted to remove compression noise, but each targets different types of noise. An additional difference between them is the filtering scheme for noise removal. DF is turned on when the pre-defined block boundary conditions are satisfied, and the filter shape is determined according to the type of boundary conditions. On the other hand, SAO uses the explicit information of four offsets. They are computed at the encoder, and then they are sent to the decoder with CABAC [3]. Likewise, the approach of SAO uses and sends some side information to improve the estimation accuracy, and it is turned on selectively when the increase of estimation accuracy is greater than the use of additional bits. That is calculated according to the welldesigned rate-distortion optimizer in HEVC.
Inspired by the overall concept of SAO, we attempted to design the network to generate and use the side information as in Fig. 1(b). The major contribution of this study is the preparation and use of offsets as side information using two deep networks. To the best of our knowledge, it is the first approach to generate and use the side information with an end-to-end deep network. Details of the network design and its application to HEVC are described in the following subsections.

B. NETWORK ARCHITECTURE
Because the proposed deep network is inspired by the SAO, it is necessary to analyze SAO in detail first. SAO is broadly composed of two parts: 1) the type of error classification for each pixel grid, and 2) the computation of the optimal offsets according to the type of error. The error classification and offset computation is conducted for every coding tree unit (CTU) independently. In the first part, the type of error can be estimated in various ways. For example, it investigates neighboring signal values, and analyzes the edge shape; this is called the SAO-edge offset (EO). Alternatively, the type of error is classified according to the pixel intensities; this is called the SAO-band offset (BO). Once the classification is done, then the second part computes the optimal offsets for each class. Because it is complicated to determine the optimal values in terms of rate-distortion manner, it simply considers the distortion, i.e., assigning the average error value as its offset for each class. Here, it is notable that SAO acts in different ways in the encoder and the decoder. The encoder contains both the classification and the offset computation parts for SAO, whereas the decoder includes only the classification module because the optimal offsets are transmitted from the encoder.
Likewise, we decompose the SAO into two parts, and propose two separate deep networks. Each network is designed to fully mimic the concept of SAO. The first deep network is for the type of error estimation. Because SAO-EO checks the shape of the image signal along the line for the classification, the proposed error classification network (ECN) classifies the type of error signal according to neighboring image intensity values. Because ECN is also embedded at the decoder side, only the available data during decoding are used for the classification, as in SAO. To be more specific, it uses the reconstructed image after DF as input, and then it passes eight residual blocks (R 1 , R 2 , . . . , R 8 ) as shown in the upper part of Fig. 2(a). Likewise, the proposed ECN can be assumed as a generalized version of SAO-EO, because the simple estimation by line-based shape for the specific direction is extended to cover any complicated relationships between the current and neighboring pixels. Moreover, it can even be assumed it also includes the concept of SAO-BO, because the pixel intensity is also considered in ECN.
The second deep network, called the offset estimation network (OEN), is designed to calculate the offsets using many pooling blocks. One proposed approach uses networks with offsets [19]. However, it uses the original offsets given by the original SAO, which strongly restricts the performance improvement with the combination of the type of error prediction network, such as ECN. Instead, we propose that the network calculates the offsets from the reconstructed image after DF and the original image. Then, OEN estimates the optimal offset values through pooling blocks to yield four offsets, where the number of offsets is adopted by SAO-EO. In addition, the offsets are set to always have positive values in SAO to save the bit rate of four offsets. Instead, the signs of the offsets are also estimated by ECN. It is notable that the original image can be used in OEN, because this network is embedded only at the encoder, whereas the four offsets are coded by CABAC, and transmitted to the decoder.
The two proposed networks, ECN and OEN, behave like those in SAO. However, there is one major difference: the order of the two modules. In SAO, the offset computation is applied when the type of error prediction is done, because the latter module assumes the suggestions of the first modules are correct. However, it is not a trivial problem for the learning-based approach, because the optimal (or correct) suggestion is not available in the practical situation. When propagating errors, for example, it is not certain whether the first and second module is responsible for the error. In the proposed architecture, instead, we propose to connect those two networks in parallel as in Fig. 2(a) and assign a penalty for both networks during error back-propagation. Then, we need not consider the intermediate ground-truth information.
Finally, the outputs of the two networks are combined at the last stage of the given system. The ECN output w has a tensor with H × W × 4, where H and W are the height and width of the given input image, respectively. The OEN output O is a 4 × 1 column vector including four offsets. Then, the estimated error e at (i, j) is obtained by If the output values of ECN are only one of ±1 or zero, then it will select one of four offsets with sign, exactly as in SAO.
The proposed method even improves this part by softening the decision for ECN to yield a value ranging from −1 to +1. This means that it will selectively use each of the four positive offsets according to its probability.
To train these networks, the conventional L2 loss function between reconstructed CTU and its ground-truth is used. Moreover, we consider the use of offset as an additional information by counting the number of transmitted bits. Therefore, the final loss function will be where D indicates the distortion cost by L2 function, and λ is the Lagrange multiplier used in HEVC. R measures the rate cost by counting the number bits used for each offset value as where it assumes the unary code for offset coding.

C. SYNTAX
It is necessary to adjust the syntax design for SAO, because the proposed network is slightly different from the conventional SAO. The syntax structure of the original SAO is shown in Fig. 3 As described previously, the proposed networks mimic the SAO, especially SAO-EO, but they have a slightly different syntax structure. Above all, they do not need a symbol for direction, because it classifies the type of error simultaneously considering all directions, or in even more complicated ways. Moreover, ECN even includes the concept of SAO-BO as well, which means the one-bit indicator for SAO-EO/BO  is unnecessary. These two points will save the bits for the syntax information. The summarized syntax structure of the proposed scheme is shown in Fig. 3(b).

D. BLOCKS
The proposed network uses two kinds of block structure. The first block is the residual block. The residual block consists of two convolution layers and one ReLU as seen in Fig. 4(a). In addition, it uses the residual learning method to facilitate learning. In the proposed scheme, the kernel size for whole convolution layers is set to 3 × 3.
The second block is the pooling block. The pooling block consists of two convolution layers, ReLU, and a max pooling layer. With the pooling layer, we can compress input data efficiently. In this part, the kernel size of the convolution layer is also set to 3 × 3. Likewise, the two proposed networks have relatively simple structures, but they work quite well compared to other networks with complicated  structures [13], [14], [18], [19]. For more details, please refer to the source code: https://github.com/yym064/DeepSAO.

A. EVALUATION MEASURES
For the evaluation of the proposed method, we chose the commonly used MPEG sequences with class A, B, C, D, E, and UHD resolution, where the first 50 frames of each sequence are tested [25]. We tested on LDP and RA modes, and used four quantization parameter (QP) values (22, 27, 32, and 37), which is a common selection. To see and compare the objective results, the Bjøntegaard delta bit-rate (BDBR) [26] was calculated, which shows the average bit-rate saving with the same quality.
For a fair comparison, we chose VRCNN [14], DCAD [17], DRN [22], and RHNet [21] as competitors. VRCNN and DCAD are originally proposed as post-processing method, and RHNet and DRN are proposed as in-loop filter. All methods were applied to slightly different positions, and therefore we followed the training and testing as the original works suggested. In details, VRCNN replaces both DF and SAO, and training samples were prepared with DF and SAO turned off. On the other hand, RHNet, DCAD, and DRN are applied with both DF and SAO turned on. Both RHNet and DCAD are applied after SAO, while DRN are positioned between DF and SAO. The proposed method replaces only SAO, and therefore the training samples were obtained with only SAO turned off. Originally, RHNet and DRN were tested as in-loop filter, but they could not yield any positive gain in the common HEVC configurations. We believe the experiments in their paper were outdated. Therefore, we use them as post processing filter for comparison.

B. NETWORK TRAINING
For a fair comparison, all comparison methods are trained on the same training set with the same HEVC configuration and on the same platform. We used 37 video sequences [27]- [29] with 1920 × 1080 HD resolution to train the network. These 37 sequences were selected outside the MPEG test set; the MPEG sequences were used for testing only. We used from the second to the eleventh frames for training because the first frames of LDP or RA compressed videos are intra-compressed frames. We trained each network for LDP and RA modes, each with I-, P-, and B-frames.
The network was implemented with Pytorch [30], and HM-16.9 software [31] was used for the HEVC reference software in our experiments. All training sequences were separately compressed for each QP. The loss was minimized using the Adam optimizer [32] during back propagation. The learning rate is set to 1e −6 , and the batch size is 4. To prepare the training set, the coded data without SAO were assumed as the network input. For convenient implementation, we applied the proposed network to 64 × 64 luminance CTUs only. However, its performance degradation is negligible.

C. PERFORMANCE EVALUATION
The full comparative results are shown in Table 1. First, the results show the proposed method outperforms all other schemes overall, especially for LDP mode with high margin. Other methods sometimes show the good performance, but they sometimes yield a negative gain. Likewise, the proposed scheme not only shows the best performance, but also stable performance for all configurations and sequences. This means the proposed method predicts well for both intra-coded and inter-coded distortions, and it is concluded that the highly complicated behaviors of error can be considerably well predicted by the offsets.
Another interesting point in the results is that the proposed scheme shows better performance for UHD and class A, B, E, i.e., high-resolution sequences, which have greater relevance. For better readability, the average performance for high-and low-resolution sequences is shown separately in Table 1.
It is observed that all methods show greater performance improvement for high-resolution sequences. It is analyzed the network is trained by HD sequences, which matches Class B. Furthermore, it is found that the larger resolution contents are generally less complicated within a fixed 64 × 64 block. In other words, it would include fewer high-frequency components because of enlargement of the image. Therefore, it will be relatively easier to predict the types of error. It is worth noting that the performance inefficiencies for smaller resolution content are also observed in the conventional SAO. However, the results for UHD are worse than for Class A, which is because the training samples were prepared with Class B. If the training data can be prepared with UHD, then its performance will be much higher. Unfortunately, we cannot find suitable UHD contents for training, and it will remain as our future work.
For more analysis, the performance comparisons with R-D curves are shown in Fig. 5. It is obvious that the proposed method outperforms the HEVC anchor for all rate points, and it is also interesting to see a larger performance increase for a higher bit-rate. In addition to the R-D curves, visual quality comparisons are also provided in Fig. 6. We selected a frame where the proposed method used fewer bits than HEVC, but it achieves higher PSNR. The visual quality improvement is also cleared observed compared to other methods.
Finally, we analyzed how the proposed method changes the ratio of SAO mode by counting the numbers of CTUs applied SAO-New/Merge, or SAO-Off modes for the class A and B dataset in Fig. 7. In general, the ratio of SAO-On modes (New/Merge) reduces as QP increases for both the conventional and proposed methods. However, the ratio of SAO-New mode largely increases for lower QP in class A, which would give large performance improvement. It is analyzed that residual signal contains more information in lower QP, and the proposed deep network is capable to predict its complex pattern, while the conventional SAO in HEVC is not. On the contrary, the proposed network reduces the number of SAO-New modes for higher QP or lower resolution. The network was analyzed to have been trained to target highly complicated patterns, abandoning other patterns with small gains.

D. COMPLEXITY ANALYSIS
In this section, we compare the proposed method to other learning-based methods in terms of the model complexity. For a fair comparison, we executed all the algorithms on an NVIDIA GTX 2080Ti GPU. Table 2 lists the number of parameters and floating point operations (FLOPs), and average execution times for class B sequences. Because the proposed method is selectively applied for each CTU as the conventional SAO, the runtime highly depends on QP value. As can be seen from Table 2, the proposed method has fewer parameters and operations than RHNet, but more than DCAD and DRN. Therefore, the proposed method takes the second highest time on average as compared to other methods. Total execution time is greatly reduced for large QP value, because only about 1-2% of CTUs uses ECN during decoding.

V. CONCLUSION
In this paper, we propose a novel in-loop filter using deep learning with side information to replace SAO in HEVC. The proposed network designs are highly motivated by SAO-EO in HEVC, that is, one network classifies the types of error according to the edge shape of the reconstructed signal, and the other network simultaneously predicts the optimal offset values. Then, these offsets are transmitted to the decoder to strongly improve the estimation accuracy. Furthermore, we propose a modified compact syntax design. It showed a 4.2% bit-rate saving in LDP mode and 2.8% in RA mode on average, which outperforms other deep-learning-based inloop or post-filter schemes.