WCDGAN: Weakly Connected Dense Generative Adversarial Network for Artifact Removal of Highly Compressed Images

In highly compressed images, i.e. quality factor $q \leq 10$ , JPEG compression causes severe compression artifacts including blocking, banding, ringing and color distortion. The compression artifacts seriously degrade image quality, which is not conducive to subsequent tasks, such as object detection and semantic segmentation. In this paper, we propose a weakly connected dense generative adversarial network for artifacts removal of highly compressed images, named WCDGAN. WCDGAN has three main ingredients of mixed convolution, weakly connected dense block (WCDB), and mixed attention. In the loss function, we add a perceptual loss to generate photo-realistic images with compression artifact removal. Experimental results show that WCDGAN successfully removes compression artifacts and produces sharp edges, clear textures and vivid colors even in highly compressed images. Moreover, WCDGAN outperforms state-of-the-art methods for compression artifact removal in terms of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM).


I. INTRODUCTION
JPEG [1] is a widely used image format that represents rich and vivid contents due to the sophisticated compression algorithm. It first converts an RGB image into YCbCr color space and downsamples chroma components. Then, it performs the discrete cosine transform (DCT) [2] on luma and chroma components in the form of non-overlapping 8 × 8 blocks. Given a quality factor, the corresponding quantization tables (QTs) are obtained, and the DCT coefficients are quantized based on the QT. Finally, the quantized DCT coefficients are converted into bitstream for transmission according to the encoding rules. However, it can be seen from Eq. (1) that due to the floor function, the DCT coefficients at the encoding end and the DCT coefficients at the decoding end are different.
The associate editor coordinating the review of this manuscript and approving it for publication was Ioannis Schizas .
where is the floor operation, Q is the quantization step length, D en and D de denote DCT coefficient at the encoding end and DCT coefficient at the decoding end, respectively.
Thus, the compression distortion is caused by quantization. As shown in Eqs. (2)-(4), DCT coefficients represent information in the frequency domain and the distortion of DCT coefficients affects the pixel values involved in the calculation in the spatial domain. Moreover, the DCT transformation is carried out block by block and thus the compressed images contain blocking and banding artifacts.
where generally N = 8; i and j represent horizontal and vertical coordinates in the spatial domain, respectively; u and v represent frequency; F represents DCT coefficient; f represents pixel values in the spatial domain. Note that when

A. RELATED WORK
For a long time, image compression has adopted DCT for block based-transform coding that is successfully applied to various coding standards such as JPEG and MPEG. In addition to the DCT-based image compression, twodimensional discrete wavelet transform (2D-DWT) decomposes an image into low and high frequency components, i.e. subbands, which can simultaneously capture the spatial and frequency information. The DWT-based image compression performs compression by quantizing and encoding the low and high frequency components. Bose et al. [3] used vector quantization (VQ) to compress medical images, and then embedded watermark into the compressed images. They analyzed the effect of watermarking on the content of medical images. In recent years, JPEG compression artifact removal [4]- [7] [8]- [11] [12] and image denoisingl [13]- [15] has been treated as computer vision tasks. Existing methods are roughly classified into two categories: traditional image processing approaches and recent deep learning schemes. The traditional image processing approaches pay more attention to the spatial and frequency domains. The spatial domain methods mainly focus on restoring flat areas [16], some edges [17] and textures [18]. The frequency domain methods mainly utilize DCT coefficients as prior information [4], [19], [20]. Shape adaptive (SA)-DCT, proposed by Foi et al. [21], combined spatial information of the images and DCT coefficients to accomplish compression artifact removal. Nowadays, deep convolutional neural networks (CNNs) have obtained continuous success in computer vision and image processing tasks. Inspired by image super-resolution based on CNN (SRCNN) [22], Dong et al. [5] proposed an end-to-end network structure for JPEG compression artifact removal, named ARCNN, which added one convolution layer to realize feature enhancement. It is the first compression artifact removal work based on deep learning. Following ARCNN, Zhang et al. [6] proposed DnCNN for general image denoising tasks, and used it for compression artifact removal. Chen et al. [23] proposed a trainable nonlinear reaction-diffusion model (TNRD) based on the fixed number of gradient descent inference steps. However, TNRD is limited in capturing features of image structure. Galteri et al. [24] utilized a generative adversarial network (GAN) [25] to remove compression artifacts. Tai et al. [7] proposed a persistent memory network (Mem-Net) which was stacked by memory blocks consisting of a recursive unit and a gate unit to learn explicit persistent memories. A deep multi-scale CNN (DMCNN), proposed by Zhang et al. [8], integrated spatial information and frequency information to make full use of DCT coefficient prior, thus producing higher quality results than the previous work. Liu et al. [26] developed a multi-level Wavelet CNN (MWCNN) with U-Net architecture [27] for multiple image restoration tasks: JPEG artifact removal, image denoising and image super-resolution.
Nowadays, generative adversarial networks (GANs) have been also rapidly developed and successfully applied to the computer vision field, such as image super-resolution, image inpainting, and style transfer. GAN is usually composed of a generator and a discriminator. The aim of discriminator is to  distinguish true and false images. The purpose of generator is to produce more realistic images so that the discriminator cannot distinguish true and false images. The ideal result is to achieve Nash equilibrium where the generator and discriminator can not reduce the cost each other. Despite its great success, GANs still have many defects including generalization and training stability. To solve these problems, many solutions are proposed to improve GANs. Arjovsky et al. [28], [29] improved the loss function and minimized the Wasserstein distance between model and data distributions. Then, Mao et al. [30] proposed a least squares loss for the discriminator, which made training stable and improve image quality.

B. MOTIVATION
When the compression rate is high, JPEG compression produces noticeable artifacts that seriously degrade image quality as shown in Fig. 1. In the figure, the quality factor q is 5. The most serious distortion is the blocking artifacts that cause banding effect and color distortion in images (see the sky and statues of compressed images). Thus, the compression artifact removal affects color restoration. In this paper, we propose a novel and effective weakly connected dense generative adversarial network for compression artifact removal, called WCDGAN. The proposed WCDGAN aims to remove compression artifacts from highly compressed images when q is lower than 10. We build a weakly connected dense generator for WCDGAN that deploys the attention mechanism [31], [32] and dilated convolution [33], [34] to learn the informative features for compression artifact removal. As shown in Fig. 2, the generator of WCDGAN has three main ingredients: mixed convolution, weakly connected dense block, and mixed attention. First, we combine dilated convolution and standard convolution, i.e. mixed convolution, to enlarge the receptive field and alleviate the grid effect caused by dilated convolution. Second, we provide a weakly connected dense block (WCDB) to reuse the features of the previous convolutional layers in the network and make features more expressive while reducing network parameters and computational cost [35], [36]. Third, we combine channel attention [37] and spatial attention with mixed convolution into WCDB to extract informative features. To consider the color distortion caused by high compression, we implement WCDGAN in the RGB domain, not YCbCr domain. For the discriminator, we use a simple network structure that contains nine convolution layers and two fully connected layers as shown in Fig. 3. Finally, we add a perceptual loss in the loss function to generate realistic images. WCDGAN successfully removes VOLUME 10, 2022 compression artifacts from highly compressed images while restoring vivid colors (see Fig. 1).
Compared with existing methods, main contributions of this paper are as follows: • We propose WCDGAN to generate photo-realistic images from highly compressed ones while remove compression artifacts. The generator of WCDGAN has three main ingredients: mixed convolution, weakly connected dense block (WCDB), and mixed attention.
• We combine standard convolution and dilated convolution into mixed convolution to capture a larger receptive field without grid effect. Moreover, we combine channel attention and spatial attention with mixed convolution into WCDB to extract more informative features.
• We add a perceptual loss to generate photo-realistic images with compression artifact removal.

II. PROPOSED METHOD A. NETWORK ARCHITECTURE
As shown in Fig. 2, the generator of WCDGAN mainly consists of three parts: shallow feature extraction, deep feature extraction and reconstruction. Denote X and Y as the input and output, respectively. Initially, two standard convolutions are used for shallow feature extraction, in which only the second convolutional layer is followed by LeakyReLU activation function. To capture more local information, we use larger convolution kernels in shallow feature extraction, where the sizes of which are 5 × 5 as follows: where f SF (·) denotes the shallow feature extraction operation; and F 0 denotes its output and then is taken as the input to deep feature extraction. The deep feature extraction is a cascade of N WCDB modules (see Fig. 2). As shown in Fig. 4, the WCDB module reuses the features of the previous convolutional layers in the network, thus making features more expressive and reducing network parameters and computational cost. As suggested by ResNet [38] and EDSR [39], we employ a residual connection in this operation to avoid the emergency of gradient vanishing and gradient explosion [40]. Therefore, we get: where f DF (·) denotes the deep feature extraction; and F 1 denotes the output of deep feature extraction and then take it as the input to the reconstruction part, which has the same convolution structure as the shallow feature extraction part. Note that the last convolution layer is not followed by activation function, but the former layer is. In the last two layers, we do not adopt the large convolution kernels, but deploy the small convolution kernels of 3×3. Finally, we obtain the final reconstruction result as follows: Our discriminator is relatively simple to manifest the learning ability of the generator as shown in Fig. 3. This discriminator has nine convolutional layers and two fully connected layers. It adopts a common block composed of Conv, ReLU and Batch Normalization [41]. s2 means the stride of convolution is two and the number of convolution kernel of current convolutional layer is doubled. Kernel sizes of all the convolutional layers are set to 3 × 3.

B. MIXED CONVOLUTION
In low-level computer vision tasks such as image superresolution, image denoising, and semantic segmentation, the receptive field of the network is an extremely important property [42]. In general, a large receptive field is good for network performance. To sum up, there are the following methods to enlarge the receptive field: (1) make the network deeper; (2) use large convolution kernels; (3) adopt an autoencoder structure; (4) utilize dilated convolution. The first and second methods inevitably introduce much calculation cost, while the first one can cause the problem of gradient vanishing or gradient explosion. Thus, they are not applicable. The third method deploys a pooling operation in the auto-encoder to reduce the feature size, thus achieving the purpose of expanding the receptive field. However, the pooling operation inevitably causes irreversible loss of information. Excessive pooling operations are unfriendly for pixel-level tasks [43]. Dilated convolution expands the receptive field without increasing parameters and thus is popularly selected. Whereas, it also has its own shortcoming that causes grid effects. To solve this problem, inspired by [44], we design a new convolution module as follows.
This convolution module consists of standard convolution and dilated convolution. As shown in Fig. 5, the input feature acts as inputs for both standard convolution and dilated convolution, then both convolutions are concatenated along the channel dimension with the same weight. Finally, we adopt a 1 × 1 convolution to compress the feature along the channel dimension and maintain the same shape as the input feature as follows: where F in and F out represent the input and output of the mixed convolution module, respectively; and f S (·) denotes the standard convolution, while f D=d (·) denotes the dilated convolution. Note that d is the number of dilation. Additionally, FIGURE 5. Mixed convolution module. Note that the outputs of both standard conv1 and dilated conv are concatenated, which is taken as the input of standard conv2 with kernel size 1 × 1.

C. WEAKLY CONNECTED DENSE BLOCK
It has been proved that ResNet solves the gradient vanishing and gradient explosion of a deep neural network to a great extent, which makes the network acquire more powerful representation capabilities. Later, some people have done experiments, in which they randomly dropped some layers of the network and then retrained ResNet. Eventually, they found that the generalization performance of ResNet was significantly improved. It shows that the neural network is not necessarily a hierarchical structure, i.e. a layer in the network can not only rely on the features of the adjacent upper layer, but also on the features of the higher layer. Based on them, Huang et al. [45] proposed dense connection to alleviate the problem of gradient vanishing, strengthen feature propagation, and encourage feature reuse. It achieves good performance in many low-level and high-level computer vision tasks. Nevertheless, with the same deep networks, deploying too many dense blocks brings a lot of burden in the network training, especially in terms of memory consumption, and the performance is not significantly improved. As a result, we remove a part of dense connection and all Batch Normalization layers [41] to achieve satisfactory results while reducing the computational cost. Another difference between WCDGAN and Huang et al.'s is that we substitute the proposed mixed convolution for the original standard convolution to obtain a larger receptive field as shown in Fig. 4. The following is a detailed description of this module: Eqs. (9), (10), (11) and (12) represent the calculation process from the first mixed convolution to the last mixed convolution, where X denotes input feature, f D=d (·) means the mixed convolution with d dilations, and X D=d is the output of the corresponding mixed convolution.
Then, we also utilize a 1 × 1 convolution to compress the concatenation of both the input feature and output of every mixed convolution. We also add a residual connection to make the training process stable. Next, we calculate: Finally, the proposed mixed attention module is denoted as f MA (·): where X is the final output of the deep feature extraction.

D. MIXED ATTENTION MODULE
Previous methods for compression artifact removal process channel information equally. However, the learned features have different meanings in channel dimension because the features in different channels are learned by different and unrelated convolution kernels. Therefore, to learn more informative features, we exploit channel attention to make the learned features more representational [46], [47]. Therefore, we firstly convert the channel-wise global spatial information into channel weights by using global average pooling [48]. As shown in Fig. 6, X = [x 1 , x 2 , . . . , x c , . . . , x C ] denotes the input feature with C channels and x c means c-th channel feature. Let z c denote the pooling result of the c-th channel feature. Similarly, Z = [z 1 , . . . , z c , . . . , z C ]. We formulate as follows: where x c (i, j) is the value at the position (i, j) of the c-th channel feature x c , f GAP (·) is the global average pooling operation, and the spatial size of feature is H × W .
To further exploit channel-wise dependency from the aggregated feature information by the global average pooling, we modify the original structure after that in [49]. We employ a 1×1 convolution followed by LeakyReLU, and then is another 1 × 1 convolution followed by LeakyReLU. Since we utilize the values less than zero after activation, we use LeakyReLU, instead of ReLU. LeakyReLU makes the weights more robust. At last, it is a 1 × 1 convolution followed by Sigmoid function. Compared with the original  structure, we add a middle layer, and LeakyReLU activation function can focus on the weaker elements to a certain extent. Since this channel attention structure fully captures nonlinear interaction between channels and takes the learned relevance as the weight, the features are refined to make the output features have stronger information expression ability as follows: where δ LRe (·) denotes LeakyReLU, σ (·) means Sigmoid function, and f C r 1×1 (·) is a 1 × 1 convolution which acts as channeldownscaling with a ratio r.
Different from the channel attention that emphasizes ''what'' is meaningful [50], the spatial attention focuses on ''where'' is an informative part, which is complementary to the channel attention. First, we substitute a 1 × 1 convolution for the original both average-pooling and max-pooling operations to obtain self-adaptive statistical characteristics. Then, we use a 5 × 5 convolution to produce a spatial attention map and then refine features as follows: where f 5×5 (·) represents a convolution with the kernel size of 5 × 5.

E. LOSS FUNCTION
We use L 1 distance to calculate the content loss L c between the output image I out and the ground truth I gt expressed as follows: Perceptual loss L p proposed by Johnson et al. [51] is to measure perceptual similarity. It is defined as the distance between two features of a pre-trained deep neural network. We first put I out and I gt into VGG-19 to obtain the outputs from a certain convolutional layer, then calculate the L 1 distance between two features as the perceptual loss as follows: where φ() stands for the last convolution layer of VGG19. L p is more informative and implements stronger supervision, thus leading to better performance. We use the least squares loss as the adversarial loss in both discriminator and generator. LSGAN [30] not only makes the training process more stable but also generates more gradients than standard GAN, thus improving WCDGAN performance. The discriminator is to distinguish true and fake images and the adversarial loss for the discriminator L D adv is obtained as follows: where P and Q are the distributions of I gt and I out , respectively. The generator aims to make I out close to I gt , thus the adversarial loss for the generator L G adv is obtained as follows: A little noise may have a very big impact on the results. At this time, it is required to add regularization terms for optimization that maintain the smoothness of the image. TV loss L tv is a common regularization term used in conjunction with other losses to constrain noise by reducing the differences between adjacent pixels as follows: where ∇ x and ∇ y calculates the gradient in the x and y directions, respectively. To get L tv , we compute the gradient of each channel in a color image, and then average the sum of them. As the training progresses, total variation in Eq. (22) is minimized and the sensitivity to artifacts is weakened. Moreover, L tv is helpful for smoothing noise generated by GAN. Finally, we combine the losses as the generator loss function for training, which is expressed as follows:

A. EXPERIMENTAL SETTINGS
In all experiments, we use 800 training images in DIV2K, released during the NTIRE2017 [52] challenge for image restoration tasks. The LIVE1 [53] dataset, the Classic5 dataset and the validation set of DIV2K are used for evaluation. In terms of the restoration of Y channel, all training and evaluation processes are kept consistent for a fair comparison. Since the methods involved in comparison do not consider color distortion, we design more comparative experiments to verify the feasibility and effectiveness of WCDGAN. We use MATLAB JPEG encoder to generate JPEG-compressed images with q = 5, 10, 20. All the approaches including the proposed WCDGAN are implemented with the Pytorch toolbox [54]. We use Adam optimizer with momentum parameters β 1 = 0.5, β 2 = 0.999 and initial learning rates 0.0002 and 0.0001 for generator and discriminator, respectively. The learning rate is divided by 2 after every 20000 iterations, while the batch size is set to 16. We extract 80×80 patches with a stride of 75 from image pairs. Adam optimizer is computationally efficient with less memory requirement, which is easy to implement. Thus, Adam optimizer is well suited for problems that the amount of data and parameters are large. It has been reported that Adam optimizer achieves the best performance in optimization [55]. In addition, hyper parameters usually need little adjustment, which makes the use of the optimizer easier and more convenient. That is, Adam optimizer is suitable for a wide range of nonconvex optimization problems in the field of deep learning [56]. Therefore, we choose Adam optimizer to optimize WCDGAN. At the beginning of training, the discriminator has a better ability to distinguish true images from fake ones, thus we only train the generators individually for 4 epochs and then update one step for either generators and discriminators alternately. Unless otherwise specified, convolutional kernel size is 3 × 3 and the number of convolutional kernel is 64. In the proposed discriminator, when the size of feature map is reduced by 2 times, the number of output channel is enlarged by 2 times.

B. COMPARISON WITH OTHER METHODS ON Y CHANNEL
In YCbCr color space, Y channel represents the luminance information, which mainly contains textures and details. Therefore, we specially get rid of the color influence and directly compare the restoration effect of textures and details. To be a fully convincing comparison, we compare WCDGAN with SA-DCT, ARCNN, TNRD, DnCNN and MemNet. We apply PSNR and SSIM [57] as evaluation metrics for quantitative comparison, which are widely used for image quality assessment. As shown in Table 1, WCDGAN outperforms the others in terms of both PSNR and SSIM. Although the gain of WCDGAN over MWCNN is not very high, the improvement cannot be ignored. WCDGAN reconstructs natural-looking images with clear textures, vivid colors, and good perceptual quality from highly compressed images, i.e. quality factor q ≤ 10. Since the TV loss in WCDGAN  causes smoothing effects, we add L 1 loss and perceptual loss in the loss function to make up for them. Thus, they lead to the detail enhancement of the results. As shown in Figs. 7 and 8, the original blocking and banding artifacts are removed clearly, and the restoration performance of the details and textures in images is also very good. On details, it can be observed from Fig. 7 that WCDGAN restores more continuous texture details but the other methods do not (see the red boxes). In Fig. 8, it is obvious that in the part of the leaves, the proposed method achieves the best restoration performance. Compared with the other methods, our results contain clear and obvious lines, and there are no bandshaped artifacts or block-shaped artifacts. It is obvious that WCDGAN performs the best in details, and the results by the other methods are somewhat rough. From the perspective of overall image quality, our results are better than the others. Therefore, it can be concluded that WCDGAN performs better than the other methods in terms of both local and global visual quality.

C. COMPARISON WITH OTHER METHODS ON RGB IMAGES
The previous experiments mainly focus on the Y channel to show the restoration performance for image textures and details. However, in the high compression rate, the color distortion is also very serious as shown in Fig. 1. Thus, we provide the restoration performance for compressed RGB images. We mainly test the restoration performance of the model on the compressed RGB image with quality factors 5 and 10. Here, we select LIVE1 and validation set of DIV2K as the testing dataset. We provide more deblocking results in Figs. 9-12. From the perspective of the overall visual perception, MWCNN and WCDGAN perform the best in Fig. 9. Both of them remove blocking and banding artifacts better than the others. However, the enlarged areas reveal that our result produces clearer edges and weaker artifacts around the numbers. Additionally, the result of MWCNN also has some slight color distortion compared to WCDGAN.
It can observed from Fig. 10 that in the global structure or local details, our results are the best by producing apparent textures and details. Especially, in the enlarged areas, our artifacts are the least among them. A close look at Figs. 11 and 12 reveals that WCDGAN removes most compression artifacts, while successfully restoring textures and details of the images. In the quantitative measurements, it can be observed from Table 2 that perceptual loss and GAN remarkably increase PSNR values while slightly decreasing SSIM values. Thus, perceptual loss and GAN could sacrifice part of the structure information for leading to better perceptual quality in HVS. It focuses on the deep features learned by the model, and emphasizes the semantic information. This is quite different from the image structure in the spatial domain. In a word, GAN structure and perceptual loss are helpful for improving the perceptual quality of human eyes. On the whole, WCDGAN achieves good perceptual quality in the results after compression artifact removal. Table 3, we provide comparison of computational complexity in terms of model size and runtime. We compare WCDGAN with state-of-the art methods: ARCNN, DNCNN, MemNet, and MWCNN. For tests, we use a PC with Nvidia GeForce GTX 1080Ti GPU and Intel i7-7700 3.60GHz CPU. It is obvious that the model size of WCDGAN are less than MemNet and MWCNN, while the runtime of WCDGAN is faster than them. Although the model size and runtime of WCDGAN are not the best, WCDGAN keeps a balance between compression artifact removal and computational complexity. The results indicate that WCDGAN is effective for removing compression artifacts from highly compressed images with low complexity.

E. ABLATION STUDY
We examine the impact of various important modules on performance. First, we evaluate the network performance with and without the mixed convolution. No mixed convolution   means that we replace the mixed convolution with the standard convolution. As shown in Table 4, when the mixed convolution module is not deployed, in Classic5 dataset the PSNR value is reduced by about 0.2dB and the SSIM value is reduced by about 0.018. In LIVE1 dataset, the PSNR value is reduced by about 0.15dB and the SSIM value is reduced by about 0.016. They are average performance over three different quality factors. It is enough to show that the mixed convolution plays an important role in improving the performance of WCDGAN. The results indicate that the large receptive field can improve the network performance. Also, we explore the effects of the attention mechanism. As shown in the figure, the PSNR is reduced by about 0.08dB and the SSIM is reduced by about 0.014 in Classic5 dataset, while the PSNR is reduced by about 0.09dB and the SSIM is reduced by about 0.014 in LIVE1 dataset. It can be figured out that the mixed attention mechanism has a positive impact on the  compression artifact removal. The experiments demonstrate that both mixed convolution and mixed attention are effective and valuable for compression artifact removal. Furthermore, we perform another ablation study on perceptual loss and GAN. Table 5 indicates that perceptual loss and GAN remarkably increase PSNR values while slightly decreasing SSIM   values. That is, perceptual loss and GAN sacrifice part of the structure information benefiting to better visual perception in HVS. In addition, we provide Inception Score (IS) [58] and Fréchet Inception Distance (FID) [59] in Table 6 to evaluate the realism of the generated images. As shown in the table, the perceptual loss and GAN contribute to the generation of photo-realistic images. Note that bigger IS and smaller FID TABLE 6. Ablation study on perceptual loss and GAN in LIVE1 and DIV2K validation datasets in terms of Inception Score (IS) [58] or Fréchet Inception Distance (FID) [59]. The numbers represent IS/FID values, while the bold ones represent the best performance.
indicate better performance. We provide an ablation study on the perceptual loss and GAN in Fig. 13. It can be observed that the result without perceptual loss and GAN contains serious artifacts, which seriously degrades the visual quality. However, the perceptual loss and GAN almost completely remove the artifacts while generating clear and natural textures in the result, thus greatly improving the visual quality.

IV. CONCLUSION
In this paper, we have proposed a novel WCDGAN for artifact removal of highly compressed images. We have attempted to remove compression artifacts and recover high-quality images from highly compressed images. The mixed convolution is able to enlarge the receptive field of the network mitigating the grid effect of dilated convolution. WCDB enhances feature reuse and makes features more expressive. Moreover, the mixed attention module makes WCDGAN learn informative features. Experimental results demonstrate that WCDGAN reconstructs natural-looking images with clear textures, vivid colors, and good perceptual quality from highly compressed images, even in q = 5. WCDGAN outperforms the state-of-the-art models for compression artifact removal in terms of both visual quality and quantitative measurements.
Although WCDGAN has achieved good performance in compression artifact removal, it has a limit of implementing a lightweight model for practical applications. Therefore, we will investigate yielding a lightweight network for compression artifact removal by replacing common convolution with depthwise separable convolution. His main research interests include image and video processing, computer vision, pattern recognition, machine learning, computational photography, video coding, virtual reality, information fusion, multimedia content analysis and management, and 3DTV. VOLUME 10, 2022