Deep Residual Autoencoder for quality independent JPEG restoration

In this paper we propose a deep residual autoencoder exploiting Residual-in-Residual Dense Blocks (RRDB) to remove artifacts in JPEG compressed images that is independent from the Quality Factor (QF) used. The proposed approach leverages both the learning capacity of deep residual networks and prior knowledge of the JPEG compression pipeline. The proposed model operates in the YCbCr color space and performs JPEG artifact restoration in two phases using two different autoencoders: the first one restores the luma channel exploiting 2D convolutions; the second one, using the restored luma channel as a guide, restores the chroma channels explotining 3D convolutions. Extensive experimental results on three widely used benchmark datasets (i.e. LIVE1, BDS500, and CLASSIC-5) show that our model is able to outperform the state of the art with respect to all the evaluation metrics considered (i.e. PSNR, PSNR-B, and SSIM). This results is remarkable since the approaches in the state of the art use a different set of weights for each compression quality, while the proposed model uses the same weights for all of them, making it applicable to images in the wild where the QF used for compression is unkwnown. Furthermore, the proposed model shows a greater robustness than state-of-the-art methods when applied to compression qualities not seen during training.


I. INTRODUCTION
Image compression represents a very active research topic due to the high impact of the data in a big amount of fields, from image sharing on the web to the most specific applications involving the acquisition of images and transfer to elaboration nodes.
Specifically, image compression refers to the task of representing images using the smallest storage space possible.
Compression algorithms play a key role for saving space and bandwidth for the memorization and transfer of large amount of images. Two different compression paradigm exist: the former is lossless image compression, where the compression rate is limited by the requirement that the original image must be perfectly recovered; the latter, more diffused, is lossy image compression, where higher compression rates are possible at the cost of some distortion in the recovered image. Among the lossy compression algorithms, the most diffused and used is the JPEG compression algorithm.
The JPEG compression algorithm first converts the original RGB image into YCbCr color space and processes the luma and chroma channels separately. It divides the luma channel of an input image into non-overlapping 8 ×  performs the Discrete Cosine Transform (DCT) on each block separately, while downsampling the chroma components with a bilinear filter. The DCT coefficients obtained from the luma channel are then quantized based on quantization tables and adjusted using the user-selected quality factor. The image is then reconstructed from the quantized DCT coefficients by using the inverse DCT. The described JPEG encoding operation introduces three kinds of artifacts in the recovered images, related to the quality factor used for the compression: i) blocketization artifacts, which come from the recombination of the 8×8 blocks, that are independently compressed without considering the adjacent blocks; ii) ringing artifacts, which are most visible along the edges and are related to the coarse quantization of the high-frequencies components; iii) blurred low-frequencies areas, which is also related to the compression of the high-frequencies in the DCT domain. The presence of these kinds of artifacts represents a problem since the general quality of the images is degraded resulting unpleasing for normal users for generic applications (e.g. projection, print, etc.), or even useless for computer vision applications where the loss of information can be potentially critic for the task [1], [2].
With the purpose of reducing these artifacts, in the last years a lot of JPEG artifact reduction algorithms have been proposed. These methods include both traditional image processing pipelines [3]- [8] and machine learning approaches [9]- [16], both making great steps in the restoration of corrupted images. However, these methods suffer from two main limits: the first one is that they need to train a different model for each possible quality factor (QF), making them not generally applicable to general images downloaded from the web unless the QF used for compression is known; the second one, is that the great majority of methods in the state of the art restores just the luma channel or do not fully exploit the knowledge about the JPEG compression pipeline.
To address these problem we propose a new method for the restoration of JPEG compressed images in YCbCr color space, based on machine learning, specifically on convolutional autoencoders. The proposed approach consists in two deep autoencoders respectively used for luma and chroma restoration, that are able to restore images independently from the quality factor used for the compression. The main contribution are the following: -the design of a method for the restoration of JPEG compression artifact that is independent from the QF used; -the design of a model trainable end-to-end that fully exploits knowledge about JPEG compression pipeline; -a thorough comparison with the state of the art on three standard datasets at fixed QFs; -an analysis of robustness of restoration results at QFs not used for training.

II. RELATED WORKS
The task of JPEG compression artifacts removal has been faced in different ways in the past years. The existing proposed methods can be broadly classified into two groups: traditional image processing methods and learning based methods.
To the first group belong methods based on traditional image processing techniques working both in the spatial and in the frequency domain. For spatial domain processing different kinds of filters have been proposed, with the intent of restoring specific areas of the images such as edges [3], textures [4], smooth regions [5], etc. Algorithms usually rely on information obtained by the application of the Discrete Cosine Transform (DCT) transform [6]. SA-DCT, proposed by Foi et al. [7], attempts to reconstruct an estimate of the signal using the DCT of the original image together with the spatial information contained in the image itself. However SA-DCT is not capable to reproduce details like sharp edges or complex textures. To overcome this limit different restoration oriented methods have been proposed, like the Regression Tree Fields based method (RTF) [8]. The RTF uses the results of SA-DCT to restore images, taking advantage of a regression tree field model.
Following the success of the application of Deep Convolutional Neural Networks (Deep-CNNs) in image processing tasks, such as image denoising [10] and Single-Image Super-Resolution [17], Deep-CNNs have been applied with success to JPEG compression artifact removal task. The basic idea behind Deep-CNNs is to learn a function to map a set of images from an input distribution, to the desired output one. In the artifact removal case the objective is to map degraded images into a distribution without the presence of the noise.
The trained neural network obtained at the end of the training process represent an approximation of the desired function for the translation of the images from a distribution to another one.
The first attempt with this kind of models has been done by Dong et al. [9] who proposed the ARCNN, a model inspired by SRCNN [17], a neural network for Super-Resolution. This first attempt has been followed by DnCNN [10], a CNN for general denoising task that has also been used on JPEG compressed images, and CAS-CNN [11], a model proposed by Cavigelli et al., who presented a much deeper model capable to obtain higher quality images. Wang et al. proposed D3 [12], a deep neural network that adopts JPEG-related priors to improve reconstruction quality which obtained an improvement in speed and performances with respect with to the previous models. In 2017, Galtieri et al. [13] developed a generative adversarial network (GAN) [18] for artifact removal and texture reconstruction.
In 2018 a bunch of new models for JPEG artifact removal has been presented, showing interesting improvements in the results quality. Liu et al. [14] proposed a Multi-level Wavelet CNN (MWCNN), a model based on the U-Net architecture [19], trained and used for multiple tasks: compression artifact removal, denoising and super-resolution. Zhang et al. [15] developed DMCNN, a Dual-Domain Multi-Scale CNN, which gains higher results quality than the previous works, by using both pixel and frequency (i.e. DCT) domain information. Lastly S-Net, the most recent method by Zheng et al. [16] proposed a "greedy loss architecture" to train deeper models capable to outperform the previous state-of-the-art.

III. PROPOSED METHOD
The method in the state of the art mainly suffer from two limits: the first one is that each machine learning model needs to know the JPEG compression Quality Factor (QF) of each input image to properly restore a compressed image; the second one is that the great majority of them are capable to restore only the luma channel without considering the chroma components, and the only one that recovers all three channels [16] does not fully exploit theoretical knowledge of the JPEG compression pipeline.
In this work we propose a method able to overcame both these problems. The first problem has to do with the way the models are trained: all of the previous existing methods make the implicit assumption that the compression quality factor QF used to compress the input images is known. In fact, most of the previous models present networks trained on datasets compressed on specific quality factors (the most common being QF = 10, 20, 30 and 40). This way of training the models leads to two limits: -the models are capable to correctly restore only images at a specific QF, with the consequence that a specific training for each quality factor is needed; -the QF used for the compression of the images is needed in order to train a model and correctly restore the images: this is usually a not known information for images coming from unknown sources (e.g. downloaded from the web), thus largely limiting the usability of the model.
In order to overcome the necessity to know the compression quality factor, we train our model on a dataset containing images compressed at different QFs: this will make the model more generic and able to restore images taken in the wild, i.e. without knowing the actual QF used. This objective poses a challenge, since the training of such a quality independent model is much harder than training on a single quality factor.
The second problem concerns the way the previous models restore the images: all of the previous state-of-the-art methods are trained on the luma channel (Y channel of the YCbCr space) of the images. This approach is based on the fact that the JPEG compression algorithm applies the DCT to the Y channel, introducing ringing and blocketization artifacts on the luma channel, while the other Cb and Cr channels are just sub-sampled the bicubic interpolation. The design and training of a model for the specific restoration of the luma component and its subsequent application for the restoration of the chroma components (as done for example by ARCNN [9]), introduces chromatic aberrations and artifacts in the final result. S-Net [16] is the only method considering this problem and instead of training a model for the restoration of just the luma component, it takes as input a full RGB image and recovers a full RGB images as output.
To overcome this second limit and obtain better results we exploit the knowledge of how the JPEG compression pipeline works and propose the use of two models for the image restoration in YCbCr space: the first model restores the Y channel; the second model then uses the result as a Structure Map (i.e. a guide) for the restoration of the chroma components. A schematic representation of the proposed method is depicted in Figure 2.

A. Luma and chroma Restoration Model
The vast majority of learning based methods for JPEG compression artifact removal in the state of the art [9]- [12], [14], [15] focus exclusively on the luma component of the images. Generally these methods perform the compression artifact removal working on the Y channel of the images, after converting them in YCbCr color space. The learned model in some cases is then applied as is also on Cb and Cr channels (e.g. [9]). These approaches do not take in consideration the chroma aspects of the images, generating results with aberrations in RGB space and low perceptual quality.
Moreover the JPEG compression algorithm, when operating with very low compression quality factors, such as QF < 20, tends to change the colors of the input images in two different ways: hue change and spatial location change. As can be seen in Figure 3, in the compressed version of the Cb and Cr channels, as expected the color resolution is reduced, and also, for some elements, the color position does not correspond to the one in the original uncompressed image.
Keeping the above considerations in mind we propose a method for restoring both luma and chroma components of the compressed images (see Figure 2). The method consists of two steps: the first step, after the conversion of the input image into YCbCr color space, involves the restoration of the Y channel alone, using a first model named LumiNet, and produces Y' as output. The second step concatenates Y'CbCr along the channel dimension and uses a second model named ChromaNet, to restore the CbCr channels. This second step uses Y' as a map of the structures present in the image (i.e. a sort of guide) to condition the second network to recover the color hue and contours, and produces Cb'Cr' as output. The final output is obtained by concatenating Y'Cb'Cr' and converting them back to RGB. Both LumiNet and ChromaNet are two different deep CNN Autoencoders both exploiting a new revisited version of the Residual Blocks [20].

B. Deep Residual Autoencoder Architecture
Autoencoder architectures have been widely used in image processing tasks like image-to-image translation [21], Super-Resolution [22], image inpainting [23] and rain removal [24]. Autoencoders generally present a structure made by three parts: the encoder, which extracts features from the n-dimensional input (usually 1 or 3 channels); a central part, that performs feature processing; and the final decoder, which decodes the processed features into the output image having the desired dimensions. Figure 4 shows a schematic representation of the proposed model, while a more detailed description of its architecture is reported in Table I. The encoder, which consists of two convolutions followed by Leaky ReLU activations, is followed by a central part for feature enhancement consisting in a sequence of Residual-in-Residual Dense Blocks (RRDB) [25], a modified version of the well known residual blocks originally introduced in the ResNet architecture [20], that have been shown to perform well in other image processing tasks, e.g. image super-resolution [25], [26]. The RRDBs blocks combine multi-level residual learning and dense connection architecture: the RRDBs are designed without the use of the Batch Normalization and the application of the residual learning on different levels. The RRDBs are shown in Figure 5: each RRDB is made of five Dense Blocks, which use only convolutions with Leaky ReLUs activation and dense skip connection structures, combined together with other skip connections. Finally, the decoder is designed in a symmetrical way with respect to the encoder part.
The same architecture has been used for both the networks for luma and chroma restoration, but with some differences: -different depth in terms of number of RRDBs used in the central part; -different feature extraction from the input in the encoder part. For the restoration of the luma (Y channel) the number of central RRDBs is set to five, while for the CbCr restoration the number of RRDB is decreased to three. The second and more important difference is in the first layer of the CbCr version of the network, which is a 3-dimensional convolutional layer.
Considering that the input of the CbCr-Net is the concatenation (along the channel dimension) of the restored Y' channel with the Cb and Cr channels, we decided to use a 3D convolution to make the model capable to correlate information about color and structures with the use of the same kernels for all the information coming from the three input channels.

IN OUT
Step 1: Y channel restoration Step 2: Cb Cr channels restoration  Cr channels, which are then concatenated with the restored Y' channel, in order to obtain the complete restored image. In order to improve the quality of the generated results, as well as to make the training process more stable, the proposed architecture include the following design choices: -removal of Batch Normalization (BN) layers from the Residual Blocks; -use of a residual scaling parameter in each Residual Block; -initialization of the model weights using a scaled version of the Kaiming initialization [27].
The removal of the batch normalization layers has been proved, in image Super-Resolution [26] and image deblurring [28] tasks, to increase the performances for the generation of images in terms of quality indexes (PSNR and SSIM [29]). The removal of the BN layers, which improve the stability of the training and the generated image appearance, makes on the other hand the training of deep networks more difficult. To solve that issues two solutions have been proved to work well: the so called residual scaling (in our model set to 0.2), to scale each residual in order to not magnify the input image  in a wrong way, and a small weight initialization, obtained by the application of the Kaiming initialization, presented by He et al. [27], scaled by a factor 0.1. As can be seen in Figure  5 the residual scaling is applied on the higher level of the residual learning architecture, i.e. on the output of each dense block and at the end of the RRDBs.
IV. EXPERIMENTAL SETUP The training of the proposed method leads to two different Deep-CNNs respectively for the restoration of the luminance and chroma components of JPEG compressed images at generic quality (i.e. QFs). In order to evaluate the results, our models have been compared with the state of the art in four different experimental setups: 1) known QF luminance restoration: comparison with the state-of-the-art methods which work only on the Y channel of the input images; 2) unknown QF luminance restoration: comparison to test the ability of the models to restore images at intermediate QFs never seen during training; 3) high and low details density areas restoration: evaluation of the performances of the state-of-the-art methods and the proposed one over specific areas of the images, by dividing the images in patches classified on high-to-low frequency (DCT domain) and high-to-low detail density; 4) color restoration: evaluation of the color restoration capability of the model on the images converted in RGB space after the elaboration.

A. Dataset
The dataset used for training is the DIV2K dataset, a collection of high-quality images (2K resolution), presented during the NTIRE2017 challenge [30] for image restoration tasks. This dataset is made of a total amount of 900 images: 800 are used for training while the remaining 100 are used for validation. The complete dataset contains also 100 images for testing. The groundtruths of this last part have not been released after the challenge, and therefore are not used in this paper.
With the purpose of increase the amount of different texture and pattern to show to the model during training, we have combined the DIV2K dataset with the FLICKR2K dataset [31], a collection of 2650 high-quality images (same resolution as the DIV2K) collected from Flickr website.
In order to train the models on different quality factors, for each image in the dataset we have applied 10 different compression levels, corresponding to the quality factors between QF = 10 to QF = 100, with step 10. The images have been compressed in RGB space with the MATLAB standard library function, then the compressed images have been converted later in YCbCr space using the PYTHON SCIKIT-IMAGE library (v0.14.0), during the training phase. The compressed version of the training dataset contains 8000 images. The same operation has been applied to the FLICKR2K dataset for a total amount of 34k training images.
The evaluation of our model has been done on the LIVE1 [29], CLASSIC-5 and BSD500 [32], three benchmark datasets widely used for JPEG artifact removal algorithm evaluation. For the evaluation of the behaviour of the models with the unknown compression quality factor we adopted the SDIVL [33], a dataset proposed for Image Quality Assessment task.

B. Evaluation metrics
The globally adopted metrics for the evaluation of the quality of images in artifact removal tasks are PSNR, PSNR-B [34] (which focus the evaluation on the blocketization in the image) and SSIM [29] indexes. For all of these three measures an higher value means better results. The PSNR and PSNR-B indexes give information about the quality of the images in terms of noise and perceived quality, with PSNR-B taking in consideration also the blocketization artifacts; SSIM index is an indicator of the quality of edges and structures contained in the. For all the three indexes considered an higher value means that the content and the structures in the reconstructed image are more similar to the ones in the target image.

C. Training Details
All the training phase has been done on a NVIDIA GTX 1070 GPU with 8 GB of memory using PYTORCH framework at version 0.4.1. The mini-batch size has been set to 8 and each input image has been cropped to a patch size of 100 × 100 pixels. During the experiments we tried to train the network with different crop sizes (32 × 32, 50 × 50, 100 × 100 and 400 × 400), observing how training deeper networks with bigger patch size gives a boost on performances over both PSNR and SSIM indexes.
We also explored the use of different numbers of RRDBs in the model: we observed how with deeper models, using this specific kind of residual blocks, the results got better and better, increasing the PSNR and SSIM values on the validation set. The final structure uses five RRDBs for the Y channel restoration model and three RRDBs for the CbCr model, where each convolution has 64 filters. We found this configuration to be the best one, with respect to the patch size, the amount of RRDBs, the number of filters and the limits due to the memory offered by our board. We trained the model using Adam optimizer [35] with β 1 = 0.9, β 2 = 0.999, with learning rate initialized at 2 × 10 −4 decreased after 200 epochs of training by a factor of 2. The training has been performed using the L1 Loss, since allow us to achieve better PSNR results and to make the training more stable.
Since the state-of-the-art methods operate only on the Y channel of the images, in order to make a fair comparison, the metrics are evaluated on the Y channel recovered by the first network with the corresponding target images, using the MAT-LAB standard libraries, over five different compression qualities: 10, 20, 40, 60, 80. For each method, on all the datasets considered, we report the results taken from the corresponding publication, except for ARCNN and MWCNN which provide the source-code, that are then used for the evaluation. Since the training of the proposed methods leads to a single model that can be used for all the quality factors, we used the same model for the evaluation at all the qualities previously mentioned. All the state-of-the-art methods compared, instead, have a different trained model for each QF considered. Table II, III and IV respectively report the comparison on the LIVE1, BSD500 and CLASSIC-5 datasets for all the three metrics considered. As can be seen our model outperforms the state of the art on all the metrics. With the proposed model we obtained improvements with respect to the state-of-the-art methods on both general perceptual quality (PSNR/PSNR-B) and structure reconstruction (SSIM). Since each index focuses of different aspects of the restoration quality, each index alone is not capable to summarize all the aspect of a good reconstruction. Therefore, we also compare the methods in a graph style-view, reported in Figures 1 and 6 to correlate the two indexes. In order for a method to obtain a more pleasing perceived quality, it is necessary that both the metrics obtain high values. It is easy from this kind of view to see how the proposed method outperforms the current state-of-the-art models even if a single model is used for all the QFs.

B. Restoration with unknown compression Quality Factor
Another kind of evaluation has been done about the capability of the models to recover images at compression quality factors never seen during training. In most of the real usecases, the JPEG compression quality factor previously applied on an image is not know: it is then important that a model is able to recover the images without this prior information.
On the other hand, if we are able at least to estimate the compression quality factor of the input compressed image, following the previous approaches we should train new models for each specific quality factor needed, or use the model trained for the closest QF to the desired one.
We compare our model with the two state-of-the-art models for which the code i available (i.e. ARCNN and MWCNN) in a specific selection of cases. Since previous models have been trained on specific quality factors, and our model has been instead trained over quality factor from 10 to 100 in steps of 10, without the use of images with QFs in between, we decided to test the model robustness on never seen artifacts. In order to perform the evaluation in a coherent way, for the state-of-the-art algorithm we used the pretrained models for the nearest quality factor, for example if the input image has been compressed with QF = 17 we used the models trained for QF = 20. For this evaluation we adopted the SDIVL dataset: for each image of the testset we applied all of the compression factors in the interval 5 − 25. The evaluation is done in the same way it has been done for in the previous secsion, by extracting Y channel and measuring PSNR, PSNR-B and SSIM indexes.
In Figure 7 are shown the results of the models on the SDIVL with all the quality factors compression. As can be seen in those graphs our model shows a more stable behaviour: the model is capable to restore images at different QFs with a more coherent and smooth behaviour in relation to the increase of the QF, in comparison with the other methods. Moreover, the previous state-of-the-art models have difficulties to restore images at quality factors distant from the trained one. It is particularly interesting to see how the other models have difficulties to restore images at higher qualities with respect to the QF used in training, in terms of structures in the images (Figure 7c), due to the more complex textures never seen by the models during training phase.

C. High and low frequency areas restoration
In order to better understand if the proposed method performs better than approaches in the state of the art only on certain image types, we conduct a further experiment: we divide the images from LIVE1 testset, compressed at QF = 10, into 64 × 64 patches and classify each of them into five categories. The categories are obtained by equally diving the patches into five bins with respect to both frequency and detail density. Patch frequency is computed as the weighted average of the 2D Fourier Transform normalized magnitude. Patch detail density is computed as the 2D average of the result of the Canny edge detection. The results for the considered evaluation metrics over the five categories of the frequency and detail density are respectively reported in Table V and  TABLE IV  COMPARISON ON TEST SET CLASSIC-5: FOR THE METHODS IN THE STATE OF THE ART A FIVE DIFFERENT MODELS ARE TRAINED FOR EACH QF  CONSIDERED. THE PROPOSED METHOD USES THE SAME MODEL FOR ALL THE QFS. Quality ARCNN [9] DnCNN [10] CAS-CNN [11] D3 [12] DMCNN [15] MWCNN [14] S-NET [16] ARGAN-MSE [ VI. From the results reported it is possible to notice that the proposed method consistently outperforms the state of the art on all the frequency and detail density categories.

D. Color Restoration
The final evaluation is focused on the color restoration capability of the models. The comparison, in the same way as done in the previous evaluations, has been done among the ARCNN [9], MWCNN [14] and our proposed model.
We restored the images from the LIVE1 testset with the lowest quality factors QF = 10, 20, 40. For this specific evaluation we restored both luma and chroma components. In the case of ARCNN and MWCNN methods, we adopted the same model for all of the three channels (Y, Cb and Cr channels), while our method uses the two different networks to first restore the luminance channel then the chrominance channels.
For this comparison we used the PSNR, PSNR-B and SSIM indexes over the restored images in RGB space, instead of only evaluating the luminance information, using the MATLAB standard library: numerical results can be seen in Table VII and visual results are summarized in some patches from the images of LIVE1 in Figure 8. As can be seen the proposed model obtains better results than the other methods in terms of PSNR, PSNR-B and SSIM index, and is also evident the difference on the final images. The blocketization and the color aberration coming from the compression are blurred and mainteined in the other models, while are cleaned by our model which reshapes the color information with respect to the structures in the images. The results are much more pleasing and realistic than the other methods ones.

VI. CONCLUSION
In this paper we proposed a deep residual autoencoder exploiting Residual-in-Residual Dense Blocks (RRDB) to remove artifacts in JPEG compressed images, that is independent from the QF used. The proposed model operates in the YCbCr color space and performs a two-phase restoration of JPEG artifacts: in the former phase, a first autoencoder exploiting 2D convolutions is used to restore the luma channel; in the latter phase, a second autoencoder, by stacking along the channel dimension the results of the first autoencoder and the original chroma channels, employs 3D convolutions to exploit the restored luma channel as a guide, and restores the chroma channels.
The main contributions of this paper are: i) the design of a method for the restoration of JPEG compression artifact that is independent from the QF used; ii) the design of a model trainable end-to-end that fully exploits knowledge about JPEG compression pipeline; iii) a thorough comparison with the state of the art on three standard datasets at fixed QFs;  iv) an analysis of robustness of restoration results at QFs not used for training. Extensive experimental results on three widely used benchmark datasets (i.e. LIVE1, BDS500, and CLASSIC-5) show that our model is able to outperform the state of the art with respect to all the evaluation metrics considered (i.e. PSNR, PSNR-B, and SSIM). This results is remarkable since the approaches in the state of the art use a different set of weights for each compression quality, while the proposed model uses the same weights for all of them, making it applicable to images in the wild where the QF used for compression is unkwnown. Furthermore, the proposed model shows a greater robustness than state-of-the-art methods when applied to compression qualities not seen during training. Since preliminary experiments with the same architecture proposed showed good results for the restoration of other artifacts (i.e. noise removal, in the CVPRW NTIRE2019 challenge), as future work we plan to investigate its extension to other single and multiple distortions [36].