Attention-Based Dual-Scale CNN In-Loop Filter for Versatile Video Coding

As the upcoming video coding standard, Versatile Video Coding (i.e., VVC) achieves up to 30% Bjøntegaard delta bit-rate (BD-rate) reduction compared with High Efficiency Video Coding (H.265/HEVC). To eliminate or alleviate different kinds of compression artifacts like blocking, ringing, blurring and contouring effects, three in-loop filters, i.e. de-blocking filter (DBF), sample adaptive offset (SAO) and adaptive loop filter (ALF), have been involved in VVC. Recently, Convolutional Neural Network (CNN) has attracted tremendous attention and shows great potential in many tasks in image processing. In this work, we design a CNN-based in-loop filter as an integrated single-model solution which is adaptive to almost any scenarios in video coding. An architecture named as ADCNN (i.e., Attention based Dual-scale CNN) with an attention based processing block is proposed to reduce artifacts of I frames and B frames, which take advantage of informative priors such as the quantization parameter (QP) and partitioning information. Different from existing CNN-based filtering methods, which are mainly designed for the luma component and may need to train different models for different QPs, the proposed filter is adapted to different QPs and different frame types, and all the components (i.e., both luma and chroma) are processed simultaneously with feature exchange and fusion between components for information supplementary. Experimental results show that the proposed ADCNN filter can achieve 6.54%, 13.27%, 15.72% BD-rate savings for Y, U, V respectively under the all intra configuration and 2.81%, 7.86%, 8.60% BD-rate savings under the random access configuration. It can be used to replace all the conventional in-loop filters and also outperforms them without increase in encoding time.


I. INTRODUCTION
In recent years, the video business has rapidly developed while the demand for high resolution and high definition has continuously improved. Especially the emerging video applications such as ultra-high-definition (UHD) television, 8K video, panoramic video, Virtual Reality (VR) video, etc., have brought great challenges to video coding and transmission. As the upcoming standard with the most advanced video coding technologies, Versatile Video Coding (i.e., VVC) [4] achieves up to 30% bit-rate reduction while keeping the same quality compared with High Efficiency Video Coding (H.265/HEVC) [3]. It is well adapted to videos with high resolutions and different formats, e.g., VR video.
On the other hand, since VVC still employs the blockbased coding and quantization structure, various types of distortions still exist such as blocking artifacts, ringing, noise, blurring, etc. The noticeable blocking artifacts is the obvious line along the block boundary caused by the discontinuity of pixel value, which significantly affect subjective visual quality. Additionally, due to the loss of high frequency information, there will be ringing effects along the sharp texture. Although these distortions are irreversible and cannot be completely eliminated, it can be reduced by specific filters. Therefore, in-loop filtering plays a significant role to restore these degradation and improve the quality of video frames before used as reference frames.
To alleviate or eliminate the coding artifacts in compressed video, in-loop filtering technologies have been adopted in video coding standards. Targeting at blocking artifacts, the de-blocking filter (DBF) [5] is designed to smooth the boundary pixels while the inner pixels in the block remain unchanged. Sample Adaptive Offset (SAO) [6] is a filter useful for reducing ringing artifacts through compensation, whereas additional bits are required to signal control flags and offset values. Inspired by the Wiener filter theory, Adaptive Loop Filter (ALF) [7] trains filter coefficients to minimize the distortion between reconstructed frames and original frames, where the coefficients will then be transmitted to decoder. These inloop filters are stacked to alleviate different kinds of distortions and to improve visual quality while saving coding bit-rate. It is noticed that these in-loop filters target at one kind of artifacts respectively, and require additional bits for coding.
Convolutional Neural Network (CNN)-based methods show great potential in artifacts removal and several algorithms have been proposed [9]- [14]. Network in [9]- [11] are designed for luma component of JPEG coded images which trained different model for different quality factor (QF). The methods in [12]- [14] are designed for video coding as in-loop or post-loop filters. However, these method lack generalization for different situations, such as applicability for all components, for all quantization parameters (QPs), or for different coding configurations.
To this end, we propose an attention based dual-scale CNN in-loop filter, named as ADCNN (i.e., Attention based Dual-scale CNN) to replace all three filters in VVC standard. The contributions of this paper are summarized as follows. 1) Different regions in a reconstructed frame have different levels of distortion and need different range of compensation. Therefore, attention mechanism is utilized in our network to adaptively generate the residual image. For boundary regions, we use the partition tree which indicates the location of blocking artifacts as the prior attention. For other regions where compensation is uncertain, we introduce spatial attention and channel attention as the self-adaptive attention.
2) The proposed model can process luma and chroma components simultaneously. Since the luma and chroma components have different characteristics in the YUV raw video format, we design a dual-scale model, with high-resolution branch for luma, and lowresolution branch for chroma. There are feature exchange and fusion between two branches and each component can benefit from the supplementary information of other components. 3) We use QP as an additional input to make the model with a single set of parameters adaptive to multi-quality distortions. Upon the above design, the proposed model can reduce different kinds of artifacts such as blocking, ringing and fuzzy distortion. As the result, it can replace all the current in-loop filters in VVC.
Moreover, the proposed model can handle not only I frames but also B frames. A Coding Tree Unit (CTU)-level adaptive on/off strategy is used in the proposed work for rate-distortion optimization.
The rest of this paper is organized as follows. Section II provides a review of in-loop filters in video coding standards and related work of CNN-based filters. The proposed network and filtering method are described in Section III. Section IV reports the experimental results. In Section V, we conclude this paper.

II. RELATED WORK
In this section, we briefly review the related work about inloop filters or image quality enhancement method in image/video compression. Firstly, the filters in current video coding standards will be introduced. And then we discuss the CNN-based method in detail.

A. IN-LOOP FILTERS IN VIDEO CODING STANDARD
The widely used video coding standards such as H.264/AVC [2], H.265/HEVC [3], VVC [4], are jointly developed by Video Coding Experts Group (VCEG) of ITU-T and Moving Picture Experts Group (MPEG) of ISO/IEC. As the coding efficiency improves, kinds of coding artifacts still exist due to the block-based lossy compression framework. To reduce those degradations in the coding process, the following three in-loop filtering method have been adopted in video coding standard to enhance the quality of frames for better reference.
Since almost all the operations in the video coding process such as prediction, transformation, quantization are conducted on the block level, blocking artifacts often appears along the boundaries of block. The discontinuity of pixels forms an unpleasant noticeable line especially under low bit rates. The de-blocking technology [5] first obtains the boundary strength according to the change of pixel value and coding parameters of the blocks on both sides of the boundary, and then carries on the corresponding filtering to the boundary that needs to be filtered using low-pass filters. It is a core filter widely used as the first filtering process in almost every block-based coding standard. Although de-blocking filters can reduce the blocking artifacts efficiently by smoothing the pixels on boundary, but the inner pixels within the block remains ignored. Other kinds of distortions need to be further reduced by other methods.
SAO [6] is a filter targeted for the ringing distortion which is caused by the loss of high frequency detail during the transform and quantization. The reconstructed pixel value fluctuates up and down around the real pixel value, forming a wavy distortion. The SAO algorithm divides the reconstructed pixel value into categories according to its characteristics, and then adds negative values to crest pixels and positive values to valley pixels for compensation. The offset of each category is calculated at the encoder and transmitted to the decoder side.
SAO is adopted as an additional filter after DBF in H.265/HEVC and VVC. ALF [7] is based on Wiener filter, which aims at minimizing the distortion between the reconstructed frame and the original one. The filter coefficients are trained at the encoder side and transmitted to the decoder, which may causes signaling overhead.
All these in-loop filters are able to suppress specific artifacts for the majority cases, however, there is still much room for performance improvement, especially at low bit rates. Besides, these filters lack generalization for other kinds of coding artifacts.

B. CNN-BASED FILTERS
Recently, inspired by the great success in high-level vision tasks such as image classification and object detection, CNN has also been adopted for low-level vision tasks to find a nonlinear mapping function from the degraded image to the desired one, such as super-resolution, image restoration, image quality enhancement, image de-noising, etc. Since CNN has powerful nonlinear fitting ability, it can achieve the state-of-the-art performance for the ill-posed image restoration problems.
As a successful case which utilized CNN in superresolution task, SRCNN [8] shows great performance. Based on SRCNN, Dong et al. proposed an artifacts reduction CNN named as ARCNN [9]. It stacked four convolution layers with the mean squared error (MSE) loss function and boosted the restoration quality for more than 1 dB of JPEG [1] coded images. Svoboda et al [10] first applied the idea of residual learning into reconstruction, and designed the mean square error loss function based on Sobel operator, which recovered the high-frequency details well for JPEG. A very complex network topology called CAS-CNN is used in [11], which adopted the weighted multi-scale loss generated by multiple outputs and brought improvement of 1.67dB in PSNR over JPEG when quality factor (QF) equals 20. The VRCNN proposed in [12] replaced DBF and SAO in H.265/HEVC, which could save 4.6% Bjøntegaard delta bit-rate (BD-rate) in the all intra configuration. STResnet [13] takes current frame and adjacent frames as inputs, which enables the network to capture not only spatial relations, but also temporal relations between frames for joint enhancement. It achieves 1.3% bitrate savings on average for the random access configuration. ARTN designed in [14] captures the temporal correlation of the consecutive frames by motion search method, and shows improvements over MPEG-2, Advanced Video Coding (AVC), HEVC with an average Peak Signal to Noise Ratio (PSNR) gain of 1.27, 0.47, 0.23dB, respectively. More recently, there are some works [26]- [29] aiming at CNN filters in VVC. In [26], an 8-layer network is proposed to enhance the visual quality of decoded I frame, where the Y/U/V components are processed one by one. Inspired by the dense connection design in DenseNet [30], a dense residual structure is used in [27]. Besides, in [28] and [29], lightweight CNN filters are used as additional filters after DBF and before SAO.
However, the applicable situation of the above methods is limited. Some of them are designed only for the luma component and ignore the supplementary function which can be brought by luma to chroma. Some of the methods train different models for different QPs which is impractical in applications. And most of them can only handle I frames but the number of B frames are far more than I frames in the actual application scenario.

III. PROPOSED ADCNN MODEL
In this section, we will introduce the proposed ADCNN in a top-down way. The main structure will be discussed first, then the basic block with attention mechanism, and auxiliary inputs of network are introduced respectively.

A. STRUCTURE OF ADCNN
Unlike the RGB color format, the components in raw YUV format are diverse from each other in many aspects. For example, luma is twice the size of chroma in width and height in widely used YUV4:2:0 format, and the pixel values of luma component have much wider dynamic range than that of chroma components as shown in Fig. 1. The average reconstruction quality of the three components is also significantly different, leading to difficulty in filtering. To deal with these problems, we design a two-stage residual network to process all three components simultaneously as shown in Fig. 2.
In the first stage, a dual-scale pipeline is implemented. The high-resolution branch (i.e., luma branch) takes the reconstructed luma component as the input, and the lowresolution branch (i.e., chroma branch) take two concatenated reconstructed chroma components as the inputs. Each branch is processed by 4 basic blocks, and there is feature exchange and fusion between two branches. The feature maps from luma branch will be sent to chroma branch after 3×3 convolution (channel=16) with stride=2, in the meanwhile, the feature maps from the chroma branch will be sent to the luma branch after 3×3 convolution (channel=16) and upsampling. The exchanged feature maps are fused into the corresponding branch by concatenation and 1×1 convolution (channel=64).
In the second stage, the chroma branch is split into two branches, one for U, and another for V. Each of the three branches is fused with its own Coding Unit map (CUmap) and QPmap which will be introduced in part C and D. Then 4 basic processing blocks are conducted for each branch to generate the final residual image, followed by a global skip connection for the reconstructed image.

B. SELF-ATTENTION BLOCK
The basic processing block in proposed ADCNN is named as a self-attention block which is composed of wide-activated convolutional layers, spatial attention, channel attention, and local skip connection as shown in Fig. 3.
J. Yu et al. reported that models with wider features before Rectified Linear Unit (ReLU) [17] activation have significantly better performance for single image superresolution [15] and restoration [16]. We also use this method for better filtering quality.  Since the compensation value varies in different regions and the convolution operation performs the same filtering through the whole image, attention operation will cover the shortage by adaptively generate scale factors in every pixels of feature maps. In the proposed work, two attention modules are introduced. The spatial attention module reduces the size of channels by conducting two convolutional layers followed by the sigmoid activation to generate a spatial-attention map (SAmap) for every spatial pixels. The channel attention module reduces the spatial size by global average pooling (GAP) and conducts two fully-connected layers followed by the sigmoid activation to generate a channel-attention map (CAmap) for every channel. Then the SAmap and the CAmap are point-wisely multiplied by feature maps.
Given a feature map ∈ × × ′ , and set the output channel to C (we set C=32 in blocks of luma branch in stage 1&2 and C=64 in other blocks), the detailed processing operation includes the following steps. a) Wide-activated convolution is composed of two convolutional layers (kernel size=3×3) with ReLU between them. The output feature maps of the first convolution layer have wider channels which is r times that of the input feature maps. Here, r denotes to expansion factor and we set r=1.5 in this manuscript. Hence, the channel of 1 will be × according to (1). Then the output channel of second convolution layer is reduced to C by (2).
b) Spatial attention is composed of two convolution layers. The channel size shrinks to a quarter of C by the first convolution layer and to only one channel by the second convolution (kernel size=3×3). ReLU and sigmoid activation are applied after each convolution respectively, as follows. 3 3 c) Channel attention is conducted following the description in [18] which is known as a method of weighting each convolutional layer adaptively. It is able to exploit the complex relationship between different channels and generate a weighting factor for each channel. Each channel is squeezed to a single numeric value using the GAP operation according to (5). Then two fully connected layers are followed by ReLU and sigmoid respectively according to (6) and (7).
d) The two generated attention map are point-wisely multiplied by 2 to scale the feature maps adaptively.
e) At last, if the number of the input channels ′ is the same as that of the output channels C, a skip connection will be added from the input to the output directly to learn the residual, which also contributes to fast convergence. Otherwise, there is no skip connection.

C. PARTITION TREE FOR PRIOR ATTENTION
Since the blocking artifacts mainly appear along the boundary of CU blocks, we can use the informative partitioning structure to effectively guide the quality enhancement process.
In H.265/HEVC and VVC video coding standards, a frame will be firstly divided into a sequence of CTUs and every CTU will be further partitioned into several CUs. Different from the single quaternary-tree partitioning method in H.265/HEVC, a quad-tree with nested multi-type tree using binary and ternary splits segmentation structure [4] is adopted as the initial new coding feature of VVC. A CTU is first partitioned by a quaternary tree structure. Then the quaternary tree leaf nodes can be further partitioned by a multi-type tree structure which has four splitting types as shown in Fig 4. In order to feed CU partitioning information into network, we could construct a feature map named as CUmap, with the positions of the boundary filled by 1 and other positions by 0.5 as shown in Fig. 5. It indicates areas and boundaries that require significant compensation and attention. Additionally, it should be noted that there are two partition trees for I frame, one for luma component, the other shared by two chroma components. The CUmap is fused into the corresponding branch in the second stage of proposed network.

D. GENERALIZATION FOR DIFFERENT QP
Different QP will lead to diverse reconstructed video frame quality. The larger the quantization parameter is, the greater the distortion will be, and the larger the distribution range of the compensation value between the reconstructed pixel and original pixel will be. For a network with a global residual connection, the output of the network should be as close to the compensation value as possible before connecting the input value. If a set of network parameters are required to adapt to the compensation values of different distribution ranges, the network should obtain this prior information in order to better filter the input with different quality. The amplitude of compensation value is generally larger when a larger QP value is applied. For the prior information QP, we construct a feature map named as QPmap which is with the same size as the input size of component, filled with the normalized QP value of the current component as in (9), and concatenated with other feature maps in the second stage. The MAXQP is set to 63 in VVC. Since each feature map of the convolution layer weights all the feature maps of the previous layer, and a spatial self-attention mechanism is used, the proposed method guides the network to convert different QP values into compensation values of different amplitudes at corresponding positions through convolution in subsequent operations.

E. LOSS FUNCTION
Video coding aims at reducing distortions in decoded frames, where commonly used quality evaluation criteria, e.g. PSNR and BD-rate, are distance-based. So distance-based loss functions, e.g. Mean Squared Error (MSE) and Mean Absolute Error (MAE), should be appropriate for filtering, and they bring similar quality gain. In this work, the loss function is chosen as the MAE between the output image of the network ˆˆˆ( , , ) I Y U V and the ground truth image ( , , ) I Y U V , i.e. formula (10). . The reason lies in that in training stage, a mini-batch data are randomly sampled frames that vary in QP. If the MSE loss is used, frames with smaller QP will contribute to lower proportion of loss of a mini-batch which may not be adequately trained. That is, the loss function is set to (11) and integer coefficients are for simplicity.

F. CTU ADAPTIVE CONTROL
In video coding, we need to enhance the visual quality while keeping bit cost as low as possible. The consideration of ratedistortion (R-D) performance is almost in every procedure in encoding, and in loop filters as well. R-D performance is calculated as shown in (12), where J is the overall R-D cost, D indicates the distortion of frame, R denotes bits cost and λ is the Lagrange multiplier. J D R   (12) It is noted that CNN is in a data-driven manner and it is hard to cover a wide range of content characteristics. When the offline-trained model is deployed in codec, it may not contribute to some image regions unseen in training. Therefore, we adopt the CTU level on/off control to avoid decrease of rate-distortion (R-D) performance. When the quality improvement is not worth to cost the signaled bits, the frame-level filtering will be turned off to avoid oversignal. Specifically, the control flags at the CTU-level and frame-level are designed as follows. For each CTU, if the R-D performance of the filtered CTU achieves better quality, the corresponding CTU control flag is enabled, otherwise the flag is disabled. After all the CTUs in one frame are determined, the frame-level R-D cost before and after filtering are calculated by (12) indicated by 1 and 2 respectively. If 1 > 2 , the frame-level flag will be enabled. Hence the corresponding frame-level flag can be encoded in the slice header and CTU-level control flags can be signaled into each corresponding CTU syntax. Otherwise, the frame-level flag is disabled and CTU-level flags will not be encoded for transmission anymore.

A. DATASET PREPARATION FOR I FRAME
Since I frames are generally encoded independently, large lossless picture datasets have been used instead of video datasets to derive extensive texture information. DIV2K dataset [19] which is consist of 900 2K resolution PNG pictures (800 images for training, and 100 images for validation) is employed to derive training and validation data through VVC reference software VTM-4.0 [22] under the All Intra configuration using QP values of {22, 27, 32, 37}. We first convert the picture from PNG format to YUV4:2:0 format, and then collect the distortion image before in-loop filters as the input and the original image as ground truth.

B. DATASET PREPARATION FOR B FRAME
YUV sequences from Xiph.org [20] and the SJTU 4K video sequence dataset [21] have been used for training the network for B frames. In overall, 90 sequences are used, with 70 for training and 20 for validation randomly selected in each category. These sequences are encoded by VTM-4.0 software under the Random Access configuration using base QP values in {22, 27, 32, 37} to collect different quality of B frames. In the Random Access configuration, the actual QP (called slice QP) for each frame varies with the reference level as shown in Table 1.

C. TRAINING SETTINGS
We implement the proposed model using TensorFlow [23]. During training, we randomly select frames from dataset and crop them into 48×48 patches. We use batch size of 128 and half of samples are I frames, the other half are B frames. The training started with a learning rate of 1e-3, and decayed by a factor of 2 every 100 epochs.

D. COMPLEXITY ANALYSIS
We analyze the number of parameters and floating-point operations per second (FLOPS) of the basic block and the entire model as shown in Table 2. Both weights and biases are provided, and both multiply and add operation are shown in the table. The model size is 9.11MB.

E. ABLATION ANALYSIS
To evaluate the efficiency of the architecture and modules in the proposed network, we conduct a series of ablations as given in Table 3 to show the ∆PSNR between the output frames and the unfiltered frames on the validation dataset of I and B frames.
In test 1, we remove the feature exchange between the luma branch and chroma branch, and the result shows that ∆PSNR of all three component declines, with steeply drop on chroma. This can be explained by the fact that the valid high-frequency information of chroma is poorer than luma. Therefore, the feature exchange and fusion strategy designed in the proposed work effectively boosts the filtering quality especially for chroma components.
Additionally, other tests shown in Table 3 explains the significant role of the different mechanisms, including wideactivated convolution (wconv), the CAmap, the SAmap, the QPmap, and the CUmap respectively. In Fig. 6, we show the converging curves during training process of ablation experiments. It can be observed that the complete ADCNN converges the fastest and steadiest while the ADCNN without the QPmap shows instability in the training process.
To further demonstrate the effectiveness of the QPmap, we design another experiment (as shown in Table 4) to compare the generalization capability for wide range of QPs with and without the QPmap. In both two experiments, we use only one QP band (QP∈ [28,36]) of training dataset to train the model, and test it in three QP bands ( QP∈ [19,27] , QP∈ [28,36] , QP∈[37,45]). From the results we can observe that the model with the QPmap is well adaptive for other QP values even if it is not trained using those kind of training samples. While the ∆PSNR of the model without the QPmap declines on lower and higher QP bands, the model with the QPmap achieves better performance, especially on lower QP bands.

F. VISUALIZATION OF SAMAP
To visualize the effect of spatial attention mechanism, the 18th frame of sequence "tempete" is shown in Fig. 7 as an example. Ground truth frame, the SAmap in the 1st attention block of second stage, ground truth residual frame (i.e., the subtraction of the ground truth frame and the unfiltered frame), and the output residual frame generated by ADCNN are visualized respectively. It is noticed that the scaling factor in the SAmap well coincide with the distribution area of the real compensation frame, leading to local adaptability and good filtering quality.

G. COMPARISON WITH VVC
We integrate the proposed model ADCNN into VVC reference software VTM-4.0 [22] using TensorFlow [23] frozen proto buffer files and TensorFlow C++ API. Specifically, when the coding process of a frame is finished, we turn off all the in-loop filters (i.e., DBF, SAO and ALF) and apply our ADCNN to all the component of reconstructed frame instead.
The experiments are conducted under the VVC common test condition (CTC) [24] with All Intra (AI) and Random Access (RA) configurations respectively. The anchor for all   [24] are used for AI configuration and Class B to Class D are for RA configuration. The detailed hardware and software of test environment are shown in Table 5. Table 6 and Table 7 compares the BD-rate [25] saving and the codec running time ratio between the proposed method and VVC reference software VTM-4.0 under the AI and RA configuration, and using CPU and GPU respectively. The codec running time ratios for the encoder and decoder are indicated by EncT and DecT in the table, respectively. We also report the number of test frames and the selected ratio of CTUlevel control flags. Moreover, there is no significant blocking artifacts between selected CTU and unselected CTU by subjective observation. It is observed that 6.54%, 13.27%, 15.72% BD-rate savings can be achieved for AI, and 2.81%, 7.86%, 8.60% for RA. It is obvious that the performance of the proposed model is significant, especially on both chroma components, and if using GPU acceleration there is no additional cost of encoding time.
For subjective evaluations, the 1st frame of sequence "FourPeople" is shown in Fig. 8 for comparison of I frame as an example, and the 18th frame of sequence "BlowingBubbles" is used for comparison of B frame as an example shown in Fig.  9. The frames before in-loop filtering are also shown for comparison. Apparently, the proposed network can efficiently remove different kinds of artifacts and outperform the conventional filters of VVC standard in both objective metrics and subjective quality. The distorted frame using the proposed ADCNN is with a higher PSNR, and meanwhile, except for the reduction of blocking artifact, the ringing and contouring effects along the edges are alleviated which could be visualized distinctly. Table 8 and Table 9 compares the BD-rate saving and codec complexity of different methods over the VVC reference software VTM-4.0 under AI and RA configuration respectively. The complexity is measured by the time ratio between the compared method and the anchor VTM4.0 using CPU. Three NN-based in-loop filtering methods are used for comparison. It should be noticed that the compared filters are all hybrid, where the corresponding CNN-based filter is used as an additional filter to the conventional ones. On the other hand, the proposed ADCNN completely replace the current     ADCNN (31.64dB, 736bits).

H. COMPARISON WITH OTHER METHODS
DBF, SAO, ALF, while still outperforms all the compared methods. Although NN-based filter commonly have higher codec complexity, the proposed method achieves better performance.

V. CONCLUSION
In this work, a CNN-based in-loop filter is proposed as an integrated solution to replace all the conventional filters in video coding. The proposed filter is able to handle various kinds of coding distortions. The major contribution of this paper lies in the solution under a wide range of QPs using a single model, taking advantages of prior information QP and attention-based basic blocks. Self-attention blocks can generate higher ratios for the distorted regions and also learn to convert the input QP values to larger positive or negative residual values at the proper location. Besides, the model is in a dual-scale architecture with feature exchange and fusion between luma and chroma branches to implement significant filter quality for both chroma components. Moreover, we provide the ablation analysis to prove the effectiveness of the attention based dual-scale architecture and prior information. The model is adaptive to different degradation levels and I/B frames. Experimental results demonstrate that the proposed ADCNN leads to significant performance gain on all three components in BD-rate savings compared to VVC, and outperforms state-of-the-art hybrid solutions using CNNbased filtering in VVC. For the next step, we will focus on the simplification (quantization, pruning, and layer reduction) of the model and acceleration of the computation.