Neural Network-Based Video Compression Artifact Reduction Using Temporal Correlation and Sparsity Prior Predictions

Quantization in lossy video compression may incur severe quality degradation, especially at low bit-rates. Developing post-processing methods that improve visual quality of decoded images is of great importance, as they can be directly incorporated in any existing compression standard or paradigm. We propose in this article a two-stage method, a texture detail restoration stage followed by a deep convolutional neural network (CNN) fusion stage, for video compression artifact reduction. The first stage performs in a patch-by-patch manner. For each patch in the current decoded frame, one prediction is formed based on the sparsity prior assuming that natural image patches can be represented by sparse activation of dictionary atoms. Under the temporal correlation hypothesis, we search the best matching patch in each reference frame, and select several matches with more texture details to tile motion compensated predictions. The second stage stacks the predictions obtained in the preceding stage along with the decoded frame itself to form a tensor, and proposes a deep CNN to learn the mapping between the tensor as input and the original uncompressed image as output. Experimental results demonstrate that the proposed two-stage method can remarkably improve, both subjectively and objectively, the quality of the compressed video sequence.


I. INTRODUCTION
Quantization in lossy image and video compression is a many-to-one mapping. This means that the decoded block can be quite different from the original one, especially at low bit-rates. Developing post-processing methods that improve visual quality of decoded images at decoder sides has attracted great interest of researchers, as they can be directly incorporated in any existing compression standard or paradigm. Most existing such methods can be classified into three categories. Methods in the first class are deblocking oriented, in which blocking artifacts are regarded as artificial discontinuities around block boundaries, and these annoying boundary pixels are smoothed out with linear or non-linear The associate editor coordinating the review of this manuscript and approving it for publication was Junxiu Liu . filters in spatial or frequency domain [1], [2]. Methods in the second class are restoration oriented. They regard the image compression as distortion, and the restoration from a decoded image is usually formulated as an ill-posed image inverse problem which is typically solved by exploiting some image model priors, including projections onto convex sets [3], [4], block-based sparse representation [5]- [11], total variation [12], and Markov random field [13], [14]. Recently, inspired by the great success of deep learning in other computer vision tasks such as single image super-resolution [15]- [17] and image denoising [18], researchers began to use deep convolutional neural networks (CNNs) for compression artifact reduction [19]- [25], leading to the third class, the CNNs based methods. In using deep neural networks one needs to train and learn a mapping function that estimates for a given decoded input its corresponding original VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ uncompressed counterpart. Works originally developed for JPEG images typically use a single image as the input [19]- [21]. Direct extension to the video domain is possible in a frame-by-frame manner. However, the quality of a compressed video can fluctuate dramatically across frames [24]. In particular, for low quality frames, though the trained network may improve their visual quality to some degree, there would always exhibit a large number of visually noticeable artifacts. Therefore, works developed for image sequences typically use multiple frames as the input [23], [24], [26] to the CNN. With the help of its neighboring high-quality frames, a low-quality frame can be adequately enhanced.
Considering that compression artifact reduction is a severely under-determined inverse problem, we make two hypotheses in this study. The temporal correlation hypothesis assumes that for every patch in the current decoded frame, there exist several similar patches with more image details along the motion trajectory in neighboring frames; the sparsity prior assumes that natural image patches can be represented by sparse activations of dictionary atoms. Video compression artifact reduction is implemented by fusing the predictions obtained under these hypotheses, and proceeded on two stages: 1) texture detail restoration, and 2) deep CNN fusion. The first stage performs texture detail restoration in a patch-by-patch manner. For each patch in the current decoded frame, one prediction is estimated based on the sparsity prior, and another one or more are predicted based on the temporal correlation hypothesis. The second stage proposes a deep CNN to fuse the resulted predictions from the preceding stage.
The main contribution of this work is that it proposes a new method to address the challenge of restoring high frequency contents lost in quantization. The two hypotheses are utilized to construct multiple predictions. These predicted frames typically contain many regions that have more high frequency contents than the current decoded frame. As input of the deep CNN, these frames not only facilitate the network training, but also allow output of improved visual quality.
The remainder of this article is organized as follows. Section II reviews some previous studies related to compression artifact reduction. Section III describes our method for enhancing the visual quality of decoded videos. Experimental results are given in Section IV. Finally, Section V concludes our study.

II. RELATED WORK
As the concomitant of the integer transform and quantization in HEVC, the quantization error of a transform coefficient affects the pixels in the same block. In spatial domain, the quantization error for a given point is the sum of the errors of each transform coefficient multiplied by the corresponding 2-D DCT basis image [27]. The quantization process can also be considered as a low-pass filter. This means that a large amount of high-frequency components of the block are removed in the process. In view of this consideration, we regard the reduction of compression artifact as image detail restoration, and review the related work in two parts: 1) restoration oriented, and 2) deep learning methods for compression artifact reduction.

A. RESTORATION ORIENTED COMPRESSION ARTIFACT REDUCTION
Many restoration oriented methods have been proposed for compression artifact reduction. Some works rely on information in the DCT domain. Foi et al. [28] proposed to perform hard-thresholding and empirical Wiener filtering in the shape-adaptive DCT domain, and to utilize clipped or attenuated DCT coefficients to reconstruct a local estimation of the image signal within an adaptive shape support. Zhang et al. proposed [29] to estimate the DCT coefficients of each block by adaptively fusing two predictions: the coefficients decoded from the compressed bitstream and the weighted average of the coefficients in nonlocal blocks. Dar et al. [30] considered a linear approximation of the nonlinear compression-decompression process and formulated the compression artifact reduction as a regularized inverse problem which was solved by the alternating direction method of multipliers [31]. Zhang et al. [32] utilized both the spatial and temporal correlation to form three predictions. The first prediction was constructed by inversely quantizing transform coefficients directly; the second was derived by representing each transform block with a temporal autoregressive model along its motion trajectory; and the third was inferred with the original coefficients from similar blocks in non-local regions. Li et al. [33] focused on super-resolution (SR) of compressed images. After analyzing the correlation between the deblocking and SR sub-process, they proposed an iterative cascading framework, where a feedback route was constructed to send extra information extracted from the SR outcome back to the deblocking module, and thus helped produce further performance gains.
Recently, learning-based methods, particularly those sparse representation-based, are gradually becoming a better choice. For solving image inverse problems such as compression artifact reduction, single image super-resolution and image denoising, sparse representation models an image x as a linear system where matrix ∈ R n×K , and vector α ∈ R K . Each column in serves as an atomic image, and thus itself is a dictionary of atoms. Vector α is known to be sparse with only k K non-zero elements. Note that α is a representation of x under and describes which atoms and what ''portions'' thereof are used for its construction [34]. Learning dictionaries, which are necessary for mapping a degraded patch to a visually more appealing one, from example images is essential for sparse representation. Jung et al. [5] proposed to iteratively train a general dictionary for image deblocking by the K -SVD algorithm, and to estimate an error threshold in the reconstruction stage for orthogonal matching pursuit according to the compression factor of the compressed image. Works in [6], [10] introduced the notion of group-based sparse representation (GSR) by explicitly exploiting the intrinsic local sparsity and nonlocal self-similarity of natural images. The GSR searches similar patches in an image and organizes them as a group, which is further assumed to be represented by a few atoms of a self-adaptive dictionary. The reconstruction is then obtained as the solution of an 0 minimization which is solved via its 1 surrogates and the split Bregman based iterative algorithm. In [7], an image was first decomposed into low-frequency (LF) and high-frequency (HF) parts. The HF part was further decomposed into blocking and non-blocking components. A dictionary was learned using training samples extracted from the HF parts of the image, and was divided into two sub-dictionaries, corresponding to blocking components and non-blocking components. Liu et al. [9] learned two dictionaries of PCA bases, one in the DCT domain and the other in the pixel domain. In restoration, two locally adaptive sparse representations that jointly determined the restored patch were generated using these two dictionaries.

B. DEEP LEARNING FOR COMPRESSION ARTIFACT REDUCTION
More recently, CNNs have been applied to compression artifact reduction following their success in many other computer vision tasks. Dong et al. [19] proposed an artifact reduction CNN (AR-CNN), which was constructed by adding a feature enhancement layer in their previously developed super-resolution CNN (SRCNN), and reported to achieve more than 1dB PSNR (peak signal-to-noise ratio) improvement over original JPEG images. Cavigelli et al. [20] proposed a deep CNN for image compression artifact suppression (CAS-CNN). It was a 12-layer CNN with hierarchical skip connections and was trained with a multi-scale loss function. It was reported that CAS-CNN allows a boost of up to 1.79dB in PSNR over JPEG images. Guo et al. [21] proposed an one-to-many network for compression artifact reduction. Their model consists of two components: the proposal component and the measurement component. Taking a JPEG compressed image as input, the former outputs a number of artifact-free candidates; and the latter evaluate the output quality using three loss functions. It was reported that their one-to-many network could reconstruct multiple artifact-free candidates that were more favored by humans. Dai et al. [22] proposed to replace deblocking and SAO in HEVC intra coding with a variable-filter-size residual learning CNN (VRCNN). The VRCNN can reportedly provide 4.6% BD-rate reduction on average against the HEVC reference implementation. Soh et al. [23] proposed a deep artifact reduction temporal network (ARTN) consisting of three temporal branches. One branch takes the current decoded frame, and the other two take the motion compensated frames as input. The outputs of the three branches are then concatenated and fed to a single network. It was reported that this ARTN achieved about 0.23dB PSNR improvement over the conventional networks for HEVC artifact reduction. Guan et al. [24] observed that there typically exists large quality variation in consecutive frames of a compressed video. Therefore, it is possible to improve the quality of a low-quality frame with the help of its neighboring high-quality frames referred to as Peak Quality Frames (PQFs). To this end, they proposed a bidirectional long short-term memory (BiLSTM) model for no-reference PQF detection and a multi-frame CNN architecture for non-PQF quality enhancement. Galteri et al. [25] proposed an image transformation approach based on a feed-forward fully convolutional residual network. Their model can be optimized either directly in terms of an image similarity loss or using a generative adversarial network (GAN), and they showed via experiments that GAN is capable of producing higher quality images with sharp details.

III. THE PROPOSED TWO-STAGE METHOD
As shown in Figure 1, the proposed artifact reduction method proceeds on two stages, namely, 1) an texture detail restoration stage, and followed by 2) a deep CNN fusion stage. The first one performs image detail restoration in a patch-by-patch manner; whereas the second uses a deep CNN to fuse the predictions of the preceding stage.
In applying deep CNN to solve regression problems such as compression artifact reduction and single image super-resolution, higher-quality inputs not only improve the probability of the model in predicting better outputs, but also facilitate the network training. Recognizing the significant role of the input in affecting the performance of the deep CNN, we propose the two-stage method for compression artifact reduction. The first stage performs in a patch-by-patch manner, and aims to hallucinate the missing details of the patches. Based on the temporal correlation hypothesis, we search matching patches in reference frames for a given patch in the current frame. Patches with more texture details are selected to tile motion compensated predictions. Based on the sparsity prior, we learn dictionary pairs representing the mapping between HEVC-compressed patches and their corresponding ground-truths. In the second stage, a deep CNN, which takes a degraded frame and the multi-hypothesis based predictions of the first stage as the input, is constructed to estimate the artifact-free image.

A. PATCH PREDICTION USING TEMPORAL CORRELATION PRIOR
The first hypothesis for image detail restoration states that for a patch in the current decoded frame, there is a high probability that several similar patches with more texture details can be found along its motion trajectory in neighboring frames. To validate this hypothesis, we code several test sequences for three quantization parameter (QP) values, i.e., QP = 42, 37, 32. For each 16 × 16 patch in the decoded frame I k , we calculate the mean squared error (MSE) between the patch and its uncompressed counterpart, and denote it as MSE d . Using I k±i , i = 1, 2, 3, 4, as reference frames, we search the best matching patch in all references. Then, the MSE between a matching patch and the original uncom-  pressed one, denoted as MSE m , is also calculated. For each patch, we count the number of matching patches that are associated with MSE m smaller than MSE d . The results for the test sequences are given in Table 1, where r 1 is the percentage of the patches that have at least one matching patch with smaller MSE m to the number of total patches, and r 2 is the percentage corresponding to the patch having at least two patches with smaller MSE m .
The results in Table 1 show that on average and over all test sequences, about 67% patches are found to have at least one matching patch with higher quality in neighboring frames, and about 47% are found to have at least two higher quality patches when using 8 reference frames. Understandably, constructing motion-compensated frames by these higher-quality patches and using such frames as input to the deep CNN in the fusion stage could be helpful not only in reducing the training difficulty, but also in improving the visual quality of the predictions of the network.
Note that the original frame (i.e., the reference signal against which the decoded frame is to be compared) is not available at the decoder side, we then confront with a no-reference image quality assessment problem. That is, in the absence of the original uncompressed image, how can we select patches of lower distortions and then using these patches predict motion compensated frames? The strategy in this study uses the default coding setting of HEVC reference software, in which an intra-picture is coded at a relatively high quality. Therefore, the patches located in intra-coded frames are directly accepted as candidates for compensation; for matching patches in inter-coded frames, the approach in [35] is employed to estimate quality scores, and patches with higher scores are selected for compensation. Specifically, assume that we will construct L predictions using temporal correlation hypothesis (we tested L = 1, 2, 3 in our experiments). For each patch in the current decoded frame, L patches are selected according to the strategy above, and the ith patch will tile the ith frame in the corresponding position.

B. PATCH PREDICTION USING SPARSITY PRIOR
The second hypothesis utilized in our approach is the sparsity prior, which assumes that most patches in a natural image can be well coded by structural primitives, e.g., edges and textures, and can be represented by a small number of basis functions chosen out of an over-complete code set.
An external database is utilized to learn an one-to-one mapping between HEVC-compressed patches and their corresponding ground-truths. We partition the samples in the database into M clusters by using the k-means clustering algorithm, and hence the samples in a cluster are similar to each other. Then, dictionaries are learned from each of the M clusters. Mathematically, let P = {X , Y } denote the training sets corresponding to a cluster, where X and Y are sets containing quantization degraded and uncompressed image patches, respectively. For each √ n × √ n image patch x p ∈ X which is cropped from the decoded image at location p, its undegraded counterpart y p ∈ Y is extracted from the original uncompressed image at the same position.
Given a cluster P, we aim to learn dictionaries from sets X and Y . The sparse representation (i.e., coding vector α in (1)) for any uncompressed patch in Y is the same as that for its degraded counterpart in X . Following the approach in [36], the learning can be formulated as the following optimization problem: where λ is a parameter that balances the sparsity of the solution and the error term; o and d are the dictionaries corresponding to uncompressed and degraded image patches, respectively; = [α 1 , . . . , α N ], and each column vector in the matrix satisfies x i ≈ d α i and y i ≈ o α i .
Since there always exists fine texture regions in images, learning only one pair of dictionaries is often insufficient to cover all variations of the patches in a cluster. Therefore, in our implementation, we learn multiple dictionary pairs for a cluster, and organize these pairs as a decision tree. Each tree consists of leaf (terminal) and internal (non-terminal) nodes. A leaf node stores a pair of dictionaries which is the optimal solution for estimating the artifact-free patch using a compression degraded patch as input. An internal node serves as a split function which attempts to efficiently select an appropriate dictionary pair for an input sample. For a cluster, the leaf nodes are recursively learned using the corresponding sample set. The maximum depth and the minimum number of samples arriving at a node are chosen to be the stopping criteria, and the growing of a tree will stop if either one is met.
The task for learning an internal node is to find a split function that can correctly partition training samples for each leaf node. There are several choices for this purpose. In this article, we employ Naive Bayes as the weak classifier in internal nodes. Let k o and k d denote the dictionary pair in the kth leaf node. The class label for (x i , y i ) is determined by where y i is the patch that reconstructed on k o with sparse coefficients coded by x i over k d , and δ T is a threshold. Then, we form the train set as {(x i , c i )} N i=1 , and train a Naive Bayes classifier as the split function for the kth leaf.
The scheme for learning all leaves and the corresponding split functions in a tree is summarized in Algorithm 1.

Algorithm 1 Decision tree learning for a cluster
Input: Data set P for a cluster Output: The learned decision tree % Learning dictionaries in leaf nodes; while the criteria for stopping the growing is not satisfied do Randomly select N sample pairs {(x i , y i )} N i=1 from P; Learn a dictionary pair from these training samples by solving (2) and (3), and store the pair as a leaf; Estimate the artifact-free patch y i ; Skip (x i , y i ); end end P ← Q ; end % Learning split functions; for each leaf node do Assign a class label to each sample in P using (4); Form the training set {(x i , c i )}; Train a Naive Bayes classifier as the split function; end

C. VISUAL QUALITY ENHANCEMENT USING A DEEP CONVOLUTIONAL NEURAL NETWORK
The first stage in our method works in a sliding window style. Based on the sparsity prior, we estimate a prediction for each patch in the current decoded frame. The resulted patches are tiled as a reconstructed image. Based on the temporal correlation hypothesis, we search the best matching patch in each reference frame. As a result, we can select several matching patches, and use them to tile motion compensated predictions. The process of the first stage can produce images with more texture details. However, it also creates small artifacts across an image since visible artifacts may occur on block boundaries. To circumvent these artifacts, the second stage in our method proposes a deep CNN that further enhances the visual quality of images.
For restoring a given decoded frame, the degraded image itself, the sparse coding based reconstruction, and several motion compensated frames are stacked as a tensor. The first convolutional layer in the deep CNN (i.e., Conv_IN in Figure 1) takes the tensor as the input to extract feature maps.
At the core of our deep CNN are many stacked residual units which have identical layout. Inspired by [15], [37] for debluring and super-resolution, we do not use batch normalization layers and remove the rectified linear unit which follows the short connection of the original building block VOLUME 8, 2020 as in [38]. As shown in Figure 1, the residual building block in our model uses two convolutional layers. The first layer (e.g., Conv_11) uses 3 × 3 × C kernels and generates 2C feature maps, where C is a constant; whereas the second one (e.g., Conv_12) uses 3 × 3 × 2C kernels, and generates C feature maps. After the first convolutional layer, a parametric rectified linear unit (PReLU), whose slope for negative inputs is learned from data rather than pre-defined, is used as the activation function. Let z i and z i+1 denote the input and output of the ith residual unit, respectively. The nonlinear mapping of the unit can be formulated as follows [38], [39]: where F represents the residual function, and W i is the weights of the ith unit. The reconstruction module in our network includes three convolutional layers (i.e., Conv_F1, Conv_F2 and Conv_F3). As illustrated in Figure 1, there are two inputs for this module, i.e., the input z 1 of the first residual blocks and the output z L+1 of the last block. For each input, 16 filters of size 3 × 3 × C are used to generate feature maps, and then fused via element-wise sum. Finally, the last convolutional layer outputs the desired reconstructed image corresponding to the input tensor using a 3 × 3 × 16 filter.
Assume that the number of the input channels is 3. A typical parameter configuration for the proposed deep CNN is listed in Table 2, where the total number of residual units is chosen to be 8, and the constant C = 24.

IV. EXPERIMENTS A. DATASETS AND TRAINING
For learning the dictionaries to be used for sparse reconstruction and training the deep CNN in our method, we have built a large-scale dataset which was derived from standard test sequences recommended by JCT-VC with four resolutions (Class B, C, D, and E). For each sequence, the corresponding HEVC bitstream is coded by HM 16.0 [40] at three QP values (i.e., 42, 37, 32) using encoder_lowdelay_P_main.cfg as the configuration file. For each QP value, a dataset containing several hundred image pairs is constructed. Let it be denoted as S = I k , I k N S k=1 , where I k is a frame selected from a test sequence, I k is the corresponding decoded frame, and N s is the total number frame pairs in the dataset. Only the first 30 frames of a sequence are included in the training set; the frames thereafter serve as test set.
For constructing the temporal correlation based predictions, block matching algorithm, which estimates motion on the basis of rectangular blocks and generates one motion vector for each block [41]- [43], is employed to obtain a straightforward and efficient implementation. Though the full search method, which checks all candidate patches to find the best match within a particular window, could finds the best motion vectors in a global sense, its computational requirement is often too high for practical implementation. Therefore, we use an improvement of the successive elimination algorithm (SEA) [41], which excludes many search positions and still allows accuracy comparable to that of the full search. In our experiments, the patch size is fixed to be 16 × 16, and the search range is determined in relation to the interval between the current frame and the reference frame. To cope with the case of abrupt scene change, when the matching error exceeds a certain threshold, as in [23], the matching patches are discarded and replaced by the patch in the current frame.
For learning decision trees used for sparse coding-based reconstruction, we randomly crop about 500,000 patch pairs from images in S (one from I k and the other from I k at the same position). The size of the patches is set to be 8 × 8. These patch pairs are then partitioned into clusters, and for each cluster, we learn a decision tree that stores dictionary pairs as the leaf nodes and split functions as internal nodes.
For training our deep CNN, we randomly select a degraded image in the training set, and crop a B × B patch (we have experimented different patch sizes, e.g., B = 96, 128, 160) by randomly selecting the pixel coordinate of the patch. And total of L (L = 1, 2 or 3) patches of the same size are cropped at the same position in the motion compensated frames, and one patch is cropped in the sparse coding-based reconstruction. These L + 1 patches along with the degraded patch itself are stacked to form a tensor as the input of the network. Then, the patch from the same location in the corresponding uncompressed image serves as the ground-truth. As a result, a training sample is formed, and the flip based data augmentation is also adopted during training.
The proposed network is implemented under the Tensorflow framework and trained using a Nvidia GTX 1080Ti GPU with a mini-batch size of 80 for 128 × 128 patches. The ADAM optimizer [44] is adopted to optimize the parameters. The learning rate is initialized to be 0.01 and then multiplied by 0.995 after every epoch. The training generally takes about 1000 epoches to converge.

B. RESULTS OF PATCH-BASED RECONSTRUCTION
We now show that the predictions using the temporal correlation and sparsity prior can indeed yield better details and textures, thus allow improvement of artifact reduction in the deep CNN fusion stage. For this purpose, we present part of the results of the predictions using these two hypotheses at QP = 42, 37, and show in Figures 2 and 3, respectively. The first column in each figure shows the original frames, the second shows degraded images compressed by HEVC. The predictions derived based on temporal correlation and constructed via the sparse representation are shown in columns 3 and 4, respectively.  From the results, we observe that the predictions (especially the sparse coding-based reconstructions) restore some local texture details. However, some noises are also seemingly introduced. To demonstrate the comparison graphically, we also use MSE as the quality metric. Let x denote a decoded patch, y and y the uncompressed original and the predicted one corresponding to x, respectively. We calculate and compare two types of MSEs: 1) MSE d between y and x, and 2) MSE r between y and y. For each 8 × 8 patch in the current frame, we count the number of reconstructed patches associated with an MSE r that is smaller than MSE d . For the patches that are associated with at least one reconstructed patch with smaller MSE r , we label them as red regions (other regions shown in blue) and show results in the fifth columns of Figures 2 and 3 (denoted as r 1 labeled). The patches associated with at least two smaller MSE r predictions are labeled as red regions and shown in the sixth columns (denoted as r 2 labeled) in the two figures.
From both figures, we can see that vast majority of patches have at least one reconstructed patch with higher quality, and about half of them have at least two higher quality predictions. The results confirm that the proposed two hypotheses can predict images with better details. Figures 4 and 5 show the subjective quality performance on some test sequences at QP = 42, 37, respectively. By magnifying the images, we can see that our method can better reconstruct visually pleasing results with sharper edges and more texture details than the other methods under comparison, namely, AR-CNN [19], ARTN [23], and MFQE2.0 [24]. The implementations of these methods are all based on their released source codes, and our codes are also available at https://github.com/wgchen-gsu/VCAR-CNN.

C. RESULTS OF DEEP CONVOLUTIONAL NEURAL NETWORK
The quality enhancement is also objectively evaluated in terms of PSNR and SSIM, which measure the PSNR and SSIM (structural similarity index metric) gain of the enhanced over the original decoded sequences, respectively. Table 3 tabulates PSNR results on some test sequences, with the best results highlighted in boldface type.   From the results, we can see that the proposed method achieves highly competitive performance compared with other leading compression artifact reduction methods. The performance gain mainly owes to that in the first stage of our method. The sparsity prior and the temporal correlation hypothesis can predict patches with more texture details and therefore can address the challenge of the loss of high frequency contents induced by quantization in video compression.
Following the approach of [45], we evaluate our method using the perception index (PI), which is a no-reference metrics that measures the perceptual quality of a reconstructed image and is calculated as a combination of the no-reference image quality measure Ma_score and the naturalness image quality evaluator (NIQE) value [46]: Recall that a lower PI value indicates better perceptual quality. For jointly quantifying the accuracy (in terms of the root mean square error (RMSE)) and perceptual quality in (6), we compare ARTN, MFQE2.0 and our method on the perception-distortion plane. For every sequence listed in Table 3, we compute the corresponding average PI and RMSE values of all frames of the test sequence and mark a dot on the perception-distortion plane. The results in Figure 6 also demonstrate that our method is admissible among the group of the tested methods.  In our framework, various combinations of the predictions resulted from temporal correlation and sparsity prior can be used as input. To evaluate the impact of such combinations, we conducted experiments on several different choices. Table 5 tabulates average PSNR and SSIM results for three combinations on the same test sequences that are considered in Table 3, where xM-1S-1D indicates that x motion compensated frames (x = 1, 2, 3) and one sparse coding-based prediction, along with the decoded frame itself, are used as the input. For all these combinations, 8 residual units are stacked as the core of our deep CNN model.
We noted that 1M-1S-1D performs the best in most cases. The results suggests that using more motion compensated predictions may not be helpful in enhancing the performance   of the model. The reason behind this might be that as for the temporal correlation based predictions of a given patch, the number of those that have higher quality than the patch is limited, and using more frames formed by tiling these patches of lower quality would not be helpful in enhancing the performance.
We also evaluate the performance while changing the number of residual units in our model. Specifically, we fix the combination of the predictions as 1M-1S-1D and use 6, 8, or 10 residual units in the deep CNN fusion stage. The average PSNR and SSIM values are given in Table 6. The results show that increasing the number of the stacked residual units does not necessarily render improvement of the network performance, possibly because of the difficulty associated with the network training when the network goes deeper.

D. COMPUTATIONAL COMPLEXITY
We evaluate the running time for the proposed method using a computer equipped with a CPU of Intel I9-9900K, 32GB of memory and the GPU of GeForce GTX 1080 Ti. We record the inference time for video sequences at different resolutions. On average, the ARTN [23], MFQE2.0 [24], and our method (1M-1S-1D) takes the computation time of 5.66, 8.12, and 15.07 seconds respectively for processing a 1920 × 1080 (i.e., Class B) frame, and 2.58, 3.55, and 8.17 seconds respectively for processing a 1280 × 720 (i.e., Class E) frame. The running time of the proposed method is about two or three times of that taken by the other methods under comparison. We would like to point out that the first stage of our method (i.e., patch-by-patch detail restoration based on the temporal correlation hypothesis and sparsity prior) is currently implemented in MATLAB and Python without parallelism. Especially, we find in experiments that the sparsity-based processing typically takes about 75% of total computation time for the proposed method. In the future, we intend to optimize the first stage processing with parallel implementation and run optimized codes on a GPU. We believe that the computation time of the proposed two-stage method could be significantly reduced.

V. CONCLUSION
We have proposed in this article a two-stage method to meet the challenge of restoring high frequency contents lost due to quantization for video compression artifact reduction. The first stage aims at image texture detail restoration and performs in a patch-by-patch manner. Based on the sparsity prior, we estimate prediction for every patch in the current decoded frame, and tile the resulted patches as a sparse coding-based reconstruction. Based on the temporal correlation hypothesis, we search the best matching patches in each reference frame, and select a number of matches with more image details to tile motion compensated predictions. In the second stage, we stack a decoded frame and its reconstructed counterparts as a tensor. A deep CNN is proposed to learn the mapping between the input tensor and the original uncompressed image.
The main advantage of the proposed approach is that it typically produces patches with more texture details under the sparsity and the temporal correlation hypothesis in the first stage. Using these high-quality predictions as the input to the deep CNN enables the model to yield the output of high visual quality. Experimental results demonstrate that our proposed method can remarkably improve both the subjective and the objective quality of the compressed video sequences. It should be noted that the multi-hypothesis prediction introduces additional computation. Also note that in the sparse coding-based patch reconstruction, the accuracy and computational efficiency are influenced by the number of clusters and the number of dictionary pairs for a given cluster. One important problem for future investigation is to determine optimal values for the parameters of the sparse coding-based reconstruction.
WEI-GANG CHEN received the Ph.D. degree from the Department of Computer Science and Technology, Shanghai Jiao Tong University, Shanghai, China, in 2004. He is currently a Professor with the School of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou, China. He has authored or coauthored more than 30 research articles in leading journals and conferences. His current interests include video and image processing, video compression and communication, and object detection.
RUNYI YU (Senior Member, IEEE) received the Ph.D. degree in automatic control from Beijing University of Aeronautics and Astronautics, Beijing, China. He is currently a Professor of electrical and electronic engineering with Eastern Mediterranean University, Famagusta, North Cyprus. He has authored or coauthored more than 60 publications in the general areas of systems and control, and signal processing. His current research interests include sampling theory, tensor analysis, and their applications in data processing.
XUN WANG (Member, IEEE) received the B.Sc. degree in mechanics and the Ph.D. degree in computer science from Zhejiang University, Hangzhou, China, in 1990 and 2006, respectively. He is currently a Professor with the School of Computer Science and Information Engineering, Zhejiang Gongshang University, China. In recent years, he has authored more than 80 articles in journals and conferences. He holds nine authorized invention patents. His current research interests include mobile graphics computing, VR/AR, computer vision, and intelligent information processing. He has five provincial and ministerial level scientific and technological progress awards. He is a member of the ACM and a Distinguished Member of the CCF.