Improving Image Compression With Adjacent Attention and Refinement Block

Recently, learned image compression algorithms have shown incredible performance compared to classic hand-crafted image codecs. Despite its considerable achievements, the fundamental disadvantage is not optimized for retaining local redundancies, particularly non-repetitive patterns, which have a detrimental influence on the reconstruction quality. This paper introduces the autoencoder-style network-based efficient image compression method, which contains three novel blocks, i.e., adjacent attention block, Gaussian merge block, and decoded image refinement block, to improve the overall image compression performance. The adjacent attention block allocates the additional bits required to capture spatial correlations (both vertical and horizontal) and effectively remove worthless information. The Gaussian merge block assists the rate-distortion optimization performance, while the decoded image refinement block improves the defects in low-resolution reconstructed images. A comprehensive ablation study analyzes and evaluates the qualitative and quantitative capabilities of the proposed model. Experimental results on two publicly available datasets reveal that our method outperforms the state-of-the-art methods on the KODAK dataset (by around 4dB and 5dB) and CLIC dataset (by about 4dB and 3dB) in terms of PSNR and MS-SSIM.


I. INTRODUCTION
(Revision) Image compression reduces spatial redundancy in images and optimizes bandwidth and storage space in various applications, including video compression, online advertising, professional photographic exchange, etc. Traditional image compression algorithms [1]- [4] depend on hand-crafted processes with intricate dependencies to increase compression efficiency. For example, JPEG [1] employs the discrete cosine transform (DCT). On the other hand, JPEG2000 [2] uses discrete wavelet transforms (DWT) to transfer an image pixel to the frequency domain and The associate editor coordinating the review of this manuscript and approving it for publication was Joewono Widjaja . decompose multi-scale decomposition into spectral bands, respectively. However, they cause artifacts along the image borders, invisible at high bit rates. Recent video codecs, such as VVC [3] incorporate intra prediction and an in-loop filter for intra-frame coding. It is also utilized in BPG [4], an image codec, to minimize redundant and irrelevant features to improve the quality of the reconstruction frame. However, traditional compression techniques cannot be optimized end-to-end, limiting their overall rate-distortion (RD) optimization performance (particularly in similarity index) and learning ability.
Nowadays, deep learning-based image compression methods [5]- [10] outperform traditional algorithms in terms of rate-distortion (RD) performance. For example, VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Ballé et al. [5] provide an end-to-end image compression using a convolutional neural network (CNN) based autoencoder. In particular, context-adaptive entropy models for learned image compression are renowned for achieving higher performance across traditional codecs. The study [6] introduced a hyperprior to add more bits to the entropy model to describe it more accurately. Minnen et al. [8] used the auto-regressive previous information to build an accurate entropy model and achieve equivalent or even higher compression efficiency performance than the conventional codec [4]. The work in [10] introduced a very similar notion by taking into account two sorts of contexts, bit consuming contexts (that is, hyperprior) and bit-free contexts (that is, auto-regressive model), achieving a context-adaptive entropy model. Although these methods enhance the compression performance, they also greatly raise the compression artifacts [11] due to the quantization process during the entropy coding and have stacked by limited respective fields in latent space.
To boost the overall image compression performance, the attention mechanism is being utilized to gather more details from the latent space while suppressing irrelevant information to allocate more bits [12]- [14]. The non-local attention mechanism [15] is effective in many visual tasks (i.e., semantic segmentation). Liu et al. [12] use the non-local attention to build implicit significance masks for leading the adaptive processing of latent features. On the other hand, Cheng et al. [13] remove the non-local block to make it easier to learn image compression. The most recent research in [14] also employed the non-local attention processes to enhance the adaptive processing of latent features. It helps the compression algorithm allocate additional bits to complicated areas (e.g., edges and textures). However, this work suffers some drawbacks. Firstly, their non-local attention (working in a single direction) has no impact on the vertical and horizontal weights to produce the respective wide field and acquire valuable features to improve RD performance. Secondly, a single mask in entropy coding will not be able to eliminate latent feature data redundancy. Thirdly, the compression artifacts are dramatically increased due to assigning bits to non-essential areas, resulting in poor reconstructed images. Motivated by it, we propose an efficient end-to-end image compression method that significantly improves the overall RD performance. Our contributions to this paper are summarized as follows: • We present an end-to-end autoencoder-based image compression model to improve the overall image compression performance. Three new blocks, i.e., an adjacent attention block (AAB), a Gaussian merge block (GMB), and a decoded image refinement block (DIRB), are included in this model.
• A plug-and-play AAB is applied to capture spatial correlations (both vertically and horizontally), suppress unnecessary information, and boost entropycoding efficiency with more crucial features by allocating additional bits.
• The GMB simulates the distribution of the latent representation in a precise manner to boost the rate-distortion optimization performance.
• Compression artifacts are inevitable on the final reconstructed images since our approach is lossy image compression. A DIRB is used to leverage global information with rich texture information and vibrant features to improve the reconstructed image quality.
• An extensive experiment is conducted on two publicly available datasets. Our method shows state-of-the-art performance in both datasets and reduces the computational complexity simultaneously. The remainder of the paper is arranged in the following manner. In Section II, traditional and existing deep learning based works are reviewed. The proposed architecture for image compression with three new blocks, e.g., AAB, GMB, and DIRB are described in detail in Section III. Section IV represents the dataset, training details, and the evaluation metrics. The qualitative and quantitative results with some ablation studies are presented in section V. Finally, section VI concludes the paper with our future research works.

II. RELATED WORKS
In this section, we briefly discuss the classical and deep learning-based image compression methods.

A. CLASSICAL METHODS
Image compression techniques are primarily concerned with reducing the levels of spatial redundancies present in images. For example, converting photos from the pixel domain to the frequency domain is simpler to compress. For instance, JPEG [1] applies the discrete cosine transform. In contrast, JPEG2000 [2] applies the discrete wavelet transform, which is handcrafted. To reduce data redundancy, high-frequency information is separated from low-frequency information, and bits are allocated according to the signal significance. Entropy coding such as Huffman [16], [17], hashing [18], and arithmetic coding [19], [20] is also utilized to increase the lossless compression performance of the image.
Currently, the intra-prediction approach [3], [4], which is often used in video compression, has been employed for image compression as well. The BPG [4] standard, for example, is based on the HEVC/H.265 [21] image compression standard, which delivers the highest possible image compression results in comparison to prior methods, such as JPEG and JPEG2000. The prediction-transform approach is used in the BPG standard [4], and 35 encoding options are utilized to create the reconstructed image, which also decreases redundant data. Then, bigger computing units, more forecast methods, more transform varieties, and more coding facilities are all supported by VVC [3]. Furthermore, the hybrid techniques employ both conventional compression techniques and the most current learning super-resolution strategies, such as [22], to achieve higher compression ratios. However, traditional algorithms are created by hand-crafted components (such as entropy coding).

B. DEEP LEARNING-BASED METHODS
Deep neural networks (DNNs) have shown to be useful for various computer vision applications in recent years, namely super-resolution, denoising, and object recognition. Some recent studies have attempted to conduct neural networks' excellent representation capabilities to improve the performance of image compression [5], [6], [8], [10], [13], [23]- [32]. Toderici et al. [23] developed the first learning-based image compression framework, which was based on a recurrent neural network (RNN). Various bitrates may be generated using a single model in their method. When compared to BPG, [28] introduces more complex RNN components and efficient reconstruction approaches to obtain equivalent or even superior MS-SSIM [33] results. Although some of these approaches [23], [25], [28] are aimed to reduce the bitrates, the rate-distortion (RD) trade-off is not considered.
By improving the RD performance, Ballé et al. [24] introduced a CNN-based framework with the generalized divisive normalization (GDN) layer, which is effective for simulating nonlinear transformations that have been frequently employed in subsequent approaches [5], [6], [8], [10], [13], [14], [34]. However, to improve the RD performance, these methods conduct the Gaussian Model (GM) distribution that is still short of encoding latent features by effectively estimating the conditional statistics. According to Rippel and Bourdev [35], a feature pyramid network (FPN) was introduced to obtain more valuable features. However, this would also lead to redundant information since convolutional methods exchange features. Li et al. [29] suggested the use of a significance map to alter the bit allocation of images, which they found to be effective. To create the significance map, a branch of a three-layer convolutional neural network was trained. However, the explicit learning material requires weight, which raises the computing cost. It is also tricky to adaptively assign bits for in-depth features, as described in [29].
In the training process, some methods [27], [32] employed an adversarial network (GAN) as a distortion assessment to lead the decoder to create more feasible pattern structures, which tends to result in reconstructed images of decent visual quality. But the pattern structures obtained in this way are not actual textures and lack fidelity. Recent studies on adaptive learning of feature significance have shown that attention strategies are quite effective. Considerable progress has been achieved in areas like as natural language processing [36] and semantic segmentation [15]. Moreover, the efficiency of noise removal and super-resolution can be dramatically improved by incorporating non-local block (NLB) into neural networks [37], [38]. In image compression, some methods [12]- [14], [39] employ attention mechanisms that allow spatially adaptive feature response for more difficult locations (i.e., patterns, saliency) in order to allocate more bits. For example, [39] introduced an improvement unit that functions on full-resolution photos to eliminate compression artifacts by filtering the reconstructed images using a simple neural network. [12]- [14] employed residual non-local attention mechanisms to improve the RD performance and compression artifacts due to the quantization procedure. However, these proposed attention mechanisms can't be exploited features in both directions (vertical and horizontal) because of their one-way weight allocation. Therefore, allocating more bits in complex regions (i.e., patterns, edges) is not fully explored to improve the final reconstructed image.
In contrast, we propose an adjacent attention block that uses distinct weights in the horizontal and vertical directions for feature maps to maintain only the most relevant information while eliminating unnecessary information, such as a complicated natural background, which has a significant impact on the performance of RD. Furthermore, in order to decrease compression artifacts, we have included a refinement block, which is capable of smoothing out and improving the visualization of the reconstructed image.

III. METHODOLOGY
This section presents the proposed deep image compression framework in detail. In Figure 1, the architecture is shown. Typically, well-known autoencoders are used in CNN-based compression techniques [5], [6], [8], [12], [29], [30], [32], [35]. Among them, variational autoencoder (VAE) has been shown to be a successful architecture for compression as first described in [6]. In this network [6], to successfully capture spatial relationships while boosting the compression performance by the entropy model efficiently, the hyper-encoder and the hyper-decoder network are employed with two times quantization. Therefore, motivated by [6], we adopt the network of the autoencoder type for learning-based image compression with three new blocks to improve the overall performance. In particular, four modules are employed in the proposed system, which are the main encoder and decoder, as well as the hyper-encoder and the hyper-decoder network, respectively. The proposed attention mechanism, referred to as the adjacent attention block (AAB), is included in each architecture module. Two additional blocks, the Gaussian merge block (GMB) and decoded image refinement block (DIRB) are introduced to increase the overall performance of the RD and improve the reconstructed image, respectively.
At first, the original image I is taken through the main encoder network and creates the corresponding latent representations l a by employing four convolutional layers with non-linear functions (e.g., GDN). After that l a is quantized tol a . The quantized latent formsl a are delivered to the decoder network to generate the final reconstructed imageÎ after arithmetic encoding (AE) and decoding (AD) [19]. Similarly, we utilize the same quantization method as [6], [8] with some modifications in the latent state (i.e., added the GMB block) in a precious way. When it comes to image compression, the goal is to obtain high-quality reconstructed images at a certain bitrate, and the entropy model is utilized to predict VOLUME 11, 2023 the bitrate target. The entropy model uses the hyperprior module in conjunction with the factorized module. This method of entropy coding uses a hyperprior network to produce an estimate of latent forms before quantizing and encoding the output of the hyperprior encoder into the bitstream. It will be encoded into the bitstream since this information is necessary for decoding, and the proper entropy model will increase compression effectiveness. In this work, the hyper-encoder module received the hyper-prior information from the latent formsl a and encoded them into latent representations l b . After that, it is quantized tol b and passed to the hyper-decoder after AE and AD process. The hyper-decoder module again retrieves the hyper-prior information froml b and estimates the relevant entropy model parameters (ϕ, ϑ) accordingly. In the following three subsections, we will go through our proposed three blocks, i.e., AAB, GMB, and DIRB of the framework.
The below loss function (ϒ) is employed to optimize the whole training process of the compression technique: The D and R are the distortion and bitrate, respectively, in this equation. The amount of distortion and the bit rate are both taken into consideration by λ. The distortion measure (MS-SSIM [33]) is denoted by d(.). H is the bitrate utilizing for encoding the latent visualizationl a andl b , respectively.
During the training phase, we use an entropy estimation method that is presented in [8], and we represent the latent features in the following way: Every latent portrayall a i is represented as a Gaussian distribution with its parameters ϕ i and ϑ i which are predicted by the probability of the hidden elementl b .l b is referred to as the hyperprior, U stands for a uniform distribution, and * is the convolution process. The hyperpriorl b is represented as below: where every univariate's distribution is represented by pˆl(i) b |ψ (i) and its parameters are represented by ψ (i) . The bit rate in our technique is made up of the bit rates for the hidden variablel b and the latent representationsl a . However, the bit rates of Equation (1) are indicated as:

A. ADJACENT ATTENTION BLOCK
In deep neural networks, the attention mechanism is an effort to emulate the similar behavior of deliberately focusing on a few significant elements while disregarding the rest. Nowadays, there are now three primary techniques to include attention mechanisms: spatial [40], channel [41], and Convolution Block Attention Module (CBAM) [42]. In the meanwhile, several researchers have adapted spatial attention processes by non-local blocks [43] to image compression [14] and [12], intending to reduce spatial redundancy. Furthermore, to construct an image generation model, [44] employed a transformer-based self-attention block which increased the size of the images. However, these methods concentrate only on building deep networks to increase the models' representation capability, which results in high computation and memory demands. Besides, in most cases, the conventional spatial attention mechanism [45] only provides one-direction weight allocation [12]- [14], which results in the loss of vital information up to a specific level. We propose a spatial adjacent attention mechanism, namely, AAB, which allocates weights coefficients based on distinct methods from both the vertical and horizontal directions. In addition to successfully suppressing irrelevant information, it may also ensure that the loss of critical information is kept to an absolute minimum. Besides, it concentrates the texture on the edges of the image with much contrast and allocates additional bits to them. Figure 2 depicts the proposed structure of AAB. Three parts are included in the structure.
• First, the coefficients of weight features are selected by the vertical weight features (VWF) and horizontal weight features (HWF) blocks. It works crosswise to obtain more stable features for allocating more bits in edge areas.
• Second, the two types of weight features are multiplied through the structure's weight multiplication (WM) module to increase the weight coefficients (for example, a tiny weight could be 0.1 × 0.2, while the highest weight could be 0.9 × 0.7).
• Third, the softmax function recognizes the weight coefficients, then extended by the weight multiplication block, and the maximum weight (MW) block selects the most significant weight coefficients [for instance, max (0.1, 0.9)]. To connect and concatenate the weights coefficients of the three parts of the model are arranged as follows: a a i,l n q=1 a a i,q d l The weights (w i ) of VWF and HWF allocated by the attention process are denoted by a a i,l , , pixel i and I denote the feature at a specific instant and the sequential feature, and the hidden layer characteristics of the feature sequence I are indicated by d l . In equation 2, m indicates the weight multiplication, and w s represents the weight coefficient of VWF in the feature space (w s = [w 1 , w 2 , . . . . w i−1 , w i ]). Then the weight operation of WM and MW is denoted by (w s * w r ), and (w s , w r ), respectively. After completing all the weight operations of VWF and HWF, one convolutional layer (Conv) and average pooling (AP) layer produce the deep feature. According to Figure 1, for high-quality compression, the suggested AAB is incorporated into the encoding, decoding, hyper-encoding, and hyper encoding networks for leveraging the channel relationship. The re-weighted feature map from AAB is included in the subsequent quantization and entropy coding components. Architecture of GMB. For each layer, N specifies the hyper-parameter that determines how many channels will be available, and C indicates how many different Gaussian models will be available.

B. GAUSSIAN MERGE BLOCK
Estimating bit rates is critical in learning-based image compression techniques. Minnen et al. [8] and Lee et al. [10] demonstrate learning-based systems in which the hyper-prior compression technique is employed and a Gaussian Model (GM) distribution is used to represent the latent representations l a in the model.
where Eˆl b (l b ) denotes the quantized entropy model [5]. The purpose of the hyper-encoder and hyper-decoder is to predict the parameters (ϕ, ϑ) of the GM. Though the single GMbased entropy model has significantly improved over prior work [5], the representation capabilities of single GM are still inadequate, particularly for complicated components. As a result, we conduct the Gaussian Merge Block (GMB) to boost the image compression performance. In our proposed GMB, thel a is expressed as below: where W i and G denote the weights assigned to various GMs and the number of GMs, respectively. To estimate the VOLUME 11, 2023 parameters (ψ) of the GMB, we generate three convolutional layers with three LeakyReLU layers, as illustrated in Figure 3. In our proposed GMB, the value of G is set to three. A total of 6 × N output channels are employed, with the first 5 × N channels being used for predicting the mean and variance of three GMs. A sigmoid layer is included in the output of the final N channels in estimating the weights of every GM. For example, the weight of the first GM is denoted by W, and the next one will be (1-W), respectively. Furthermore, by creating G(G ≥ 4) GMs, we may increase the number of output channels on the GMB block to 4×G×N (C = 4×G). For G of GMs, the mean and variance parameters are estimated by the first 3 × G × N channels in the same manner. The softmax layer is utilized after the final G × N channels to figure out each GM weight.

C. DECODED IMAGE REFINEMENT BLOCK
The proposed compression approach for the entropy model employs a quantization procedure. As a result, compression artifacts may appear in the reconstructed image. Thus, a proposed DIRB, at the decoder side, is adjoined after image reconstruction, which significantly improves the performance of the decoded image. To improve the representations of feature maps, the proposed refinement block uses a self-similarity measures and inter-spatial relationship information. The following is a concept of a deep neural network process: where i is the position reference of the feature reaction awaiting to be computed, and j is the counted position reference of input features. The input and output signals are represented by I and O, respectively, with the same area and channel number. At the input feature map, F(.) calculates the similar reaction between i and all j. The response is multiplied by the matching features representation calculated by γ (.) after normalizing with a coefficient f (O). Refinement block can extract the long-distance dependency between multiple places by calculating the reaction matrix, which may efficiently enlarge the receptive fields of deep convolution layers. It solves the shortcomings of traditional standard convolution operations, which can only gather minimal data from nearby regions. Figure 4 depicts our proposed DIRB for obtaining spatial relevant information in a feature space.
where X 0 represents the input features and F O i , O j represents the reaction weight vector for every position. Convolution operations (α(.) and β(.)) are used to produce the features descriptions, which are multiplied to create the matching matrix.
where softmax (.) and X IF denote the normalized operation and improved features, respectively. Improved features X IF are calculated by multiplying the reaction weight vector for the feature representations, produced by the 1×1 convolution operation γ (.).
In the refinement block, we included a residual connection that constructs similar to a residual learning network by combining input feature X 0 and improved feature X IF . It enables the component to concentrate on improving high-frequency information rather than low-frequency information.
Comparing the ways of gradually expanding the receptive fields in typical regular convolution procedures, our proposed refinement block can acquire the spatial dependency between any two locations for the purpose of further refining and improving the flow of gradients and information. Our DIRB can also add global information to the features that allow our network to utilize better the promising information contained within the low-resolution reconstructed images.

A. DATASET
The experimental datasets are primarily separated into two types: training data and test data. We randomly select 300k images from the Open Images dataset [46] and crop them to a 256 × 256 pixel size for training. For testing, the KODAK image dataset [47] and CLIC professional validation dataset [48] are employed, including high-resolution natural images. The KODAK dataset comprises 24 photos with a resolution of 512 × 768 pixels and a broad range of contents and patterns, which are artifact-sensitive (restricted color gradients). As a result, it's frequently been employed to test image compression techniques. The CLIC dataset [48] includes 41 pictures acquired by mobile phones and professional cameras. The images have greater resolutions, with an average size of 1913 × 1361 pixels for mobile shots and 1803 × 1175 pixels for professional photos.

B. TRAINING DETAILS
All experiments are carried out on a Windows 10 workstation with an Intel Core i7 processor, 32GB of RAM, and a single NVIDIA GeForce RTX 2070 GPU with 8GB of memory running under the CUDA 10.0. To finish the experiment's code, we used Python 3.7.0 with Conda environment. Pytorch 1.0.0 is used as the deep learning framework. For the model implementation process, the Adam optimizer [49] is conducted to train all models for 1.8M steps with a batch size of 8. For the first 110k iterations, the learning rate is determined to 0.0003, then reduces to 0.00003 for the other 35k iterations, and finally to 0.00001 for the final 35k iterations. The channel numbers of the latent and hyper latent variables are set in the proposed model at 320 and 192, respectively.

C. EVALUATION METRICS
This article evaluates the rate-distortion in bits per pixel (bpp) while the model is optimized by employing the PSNR [50] and MS-SSIM [33]. To show their coding efficiency, ratedistortion (RD) curves are generated. We followed the same setting of [51], and for MS-SSIM, the λ values are fixed to 2. 41, 5.24, 8.31, 15.65, 30.43, and 60.56.
The experiments are also carried out using the MS-SSIM quality measure, as seen in Figure 9 (b). We provide MS-SSIM values in decibels (i.e., −10log 10 (1 − MS − SSIM )) to better illustrate the progress. It is clearly said that our method shows state-of-the-art performance against both the traditional methods, including [4] and [3], and deep learningbased methods, including [8], [10], [13], [14], and [5]. Therefore, we can say that the AAB, GMB, and DIRB we have presented have a significant influence on showing higher RD performance and improving the reconstructed image's similarity. Please refer the ablation study (in next sub-section) to get a better idea of the modules' efficacy.
We employ another CLIC [48] professional validation dataset to confirm the robustness of our technique, and the results are shown in Table 1. It is noteworthy that our approach also yields state-of-the-art results in terms of MS-SSIM which we express in decibels (i.e., −10log 10 (1 − MS − SSIM )). However, regarding PSNR, our method achieves FIGURE 5. Qualitative performance comparison of the our reconstructed images with existing methods, such as Ball é et al. [5], BPG444 [4], JPEG [2], and VTM 8.0 [3]. These images are taken from KODAK [47] dataset.

C. ABLATION STUDY
We perform some ablation studies on the KODAK dataset [47] to further illustrate the robustness and effectiveness of our proposed approach.
In Table.2, we provide an investigation by adopting current attention modules replacing our suggested attention module in our proposed approach for two kinds of λ values. PSNR performance is relatively poor (27.23) for λ = 2.41 and 8.31 at low and high bit rates when the attention module is not included in the baseline model. When the attention modules of Cheng et al. [13] and Chen et al. [14], and ours are utilized, the PSNR values improve by around 15% (31.89 vs. 27.23 and 32.21 vs. 27.23) for [13] and [14], and by about 17% (32.98 vs. 27.23) for ours at low bit rates, respectively. The PSNR improves significantly when λ = 8.31, for example, for our suggested adjacent attention module, the PSNR is improved by roughly 23% (35.67 vs. 27.51) and even by around 5% (35.67 vs. 33.87 and 35.67 vs. 34.01) over the prior modules of [13] and [14].  To further verify the effectiveness of our proposed three modules in the main architecture, we have carried out another experiment in terms of PSNR, MS-SSIM, and Inference Time by replacing and adding the modules to bring the bpp close to 1 in Table 3

VI. CONCLUSION
This paper introduces a deep learning-based efficient image compression model that utilizes the autoencoder-style network. To increase the overall performance of image compression, three additional components, namely Adjacent Attention Block (AAB), Gaussian Merge Block (GMB), and Decoder Image Refinement Block (DIRB), are included in this model. The AAB is used to concentrate the texture on the edges of the image in order to allocate additional bits for capturing spatial correlations and concealing irrelevant features. The GMB and DIRB are applied to simulate the distribution of the latent representation and improve the defects in low-resolution decoded images, respectively. Two publicly available datasets (KODAK and CLIC) are employed in this experiment. Experimental findings reveal that the proposed model outperforms existing deep learning-based techniques in terms of MS-SSIM and PSNR. In the future, we will investigate additional components that influence reconstructed images, such as the entropy model.