Learned Image Compression with Separate Hyperprior Decoders

Learned image compression techniques have achieved considerable development in recent years. In this paper, we find that the performance bottleneck lies in the use of a single hyperprior decoder, in which case the ternary Gaussian model collapses to a binary one. To solve this, we propose to use three hyperprior decoders to separate the decoding process of the mixed parameters in discrete Gaussian mixture likelihoods, achieving more accurate parameters estimation. Experimental results demonstrate the proposed method optimized by MS-SSIM achieves on average 3.36% BD-rate reduction compared with state-of-the-art approach. The contribution of the proposed method to the coding time and FLOPs is negligible.


I. INTRODUCTION
I MAGE compression is an essential technology in digital age. Traditional codecs [1]- [6], such as JPEG [1], BPG [4] and VVC [6] have achieved significant coding efficiency. However, as the design complexity and coding complexity continue to increase, it becomes increasingly difficult to further optimize them. In addition, modules in traditional codecs designed with the optimization goal of minimizing mean square error (MSE) also make it difficult to optimize for general quality evaluation metrics.
With the resurgence of artificial neural network techniques, learned image codecs [7]- [19] have attracted wide interest in recent years. By jointly optimizing distortion and rate through Lagrangian multiplication, the work [7] have developed a framework for end-to-end training of image compression model and achieved impressive performance. Based on this model, researchers have carried out extensive efforts [8]- [11] to reduce redundancy in the latent variables. Balle et al. [8] proposed a hyperprior network based on variational autoencoder that consumes a small number of extra bits to encode the structural information of the latent representation. Lee et al. [9] and Minnen et al. [10] proposed the use of context models to further reduce the spatial correlation in the latent space. Cheng et al. [11] proposed using a Gaussian mixture likelihood to parameterize the distributions of latent variables, providing more flexibility to fit arbitrary distributions. Guo et al. [12] achieved train-test consistency and reserved latent expressiveness via a novel soft-then-hard quantization method. Guo et al. [13] utilized a channel-adaptive codebook to accelerate arithmetic coding of learned image compression while maintaining the rate-distortion performance. In addition to the variational autoencoder-based methods, a substantial body of work based on other learned structures have also achieved impactful results. Recurrent neural networks-based methods [14], [15] have good scalability in coding and can recursively compress the residual information. Generative adversarial networks-based approaches [16], [17] are able to achieve excellent subjective quality at extremely low rates. Flow-based model [18], [19] allows a single model to achieve both lossy and lossless compression of images through a wavelet-like transform and optional quantization, which is potential to surpass variational autoencoder based methods.
In this paper, we focus on the variational autoencoder-based image compression framework. Instead of directly minimizing the redundancy in the latent space, we employ a direct and effective structure to obtain higher compression performance. The work we study in this paper is based on the work of Cheng et al. [11], which uses a Gaussian mixture model (GMM) prior and achieved state-of-the-art performance. In Cheng et al. [11], as shown in Fig. 1(a), the decoding process of GMM parameters uses only a single hyperprior decoder, which leads to the inability to fully exploit the GMM's ability to fit the data and becomes a bottleneck that constrains the compression performance. When we use a single hyperprior decoder, the decoding process of the three parameters, mean, variance, and weight in Gaussian model, needs to share the same hyper decoder. Decoding three different physically significant parameters simultaneously may be somewhat difficult for a single decoder, and the weights of the final trained decoder are a compromise of the three. We perform an intuitive demonstration to show that this leads to degradation from a ternary Gaussian model to a binary one. In order to avoid this problem, we propose separate hyperprior decoders as shown in Fig. 1(b), to decouple the parameters of different physical significant in GMM and accordingly design different decoding networks to train and decode the parameters of the likelihood distribution more efficiently.

II. PROPOSED METHOD A. Formulation of Learned Image Compression
The image compression process based on the variational autoencoder [7] can be formulated by where x,x, y,ŷ denote input images, reconstructed images, the latent variables and the quantized latent variables, respectively. Notation Q denotes real round-based quantization in inference stage. Notation g a and g s denote the encoder and decoder, respectively, and φ and θ correspond to their parameters. In the training process, considering that nondifferentiable quantization will result in the inability to backpropagate the gradient, the work uses a uniform noise to replace the quantization here.
whereỹ andx represent the latent variables with uniform noise added and its decoding reconstruction. Notation U denotes adding uniform noise in training stage. The difference betweenx and x is represented as distortion, and the entropy ofỹ approximates the real code length.
To reduce the spatial redundancy in the latent variables y, the work [8] proposed an auxiliary hyperprior network encoding its structural information z. Formulated by where h a and h s denote the encoder and decoder of this hyperprior network, and φ h and θ h correspond to their trainable parameters.

B. Separate Hyperprior Decoders
To enhance modeling capabilities for the prior pŷ |ẑ (ŷ|ẑ), The work of Cheng et al. [11] proposed to use GMM, which contains three parameters of different physical significance, weightω, meanμ and varianceσ.
These parameters are obtained from the entropy parameter network f . And the hyper parameter K denotes the number of Gaussian models in the GMM, which is set to 3 in both our and Cheng's model. Function c m denotes the context model and the ŷ denotes the already decoded subset ofŷ [9]. The 2-nd column in Fig. 2 demonstrates the impact of employing this strategy on the data modeling capabilities of the model. Note that weightω has a total of five dimensions, which are batch size, height, width, channel, and K in order. We first take the minimum value of ω in the last dimension (i.e., K), and then take the average of the minimum value in the dimension of channel. The value of this average can express the modeling ability of the GMM output from the hyperprior decoder. For example, this value of 0 is equivalent to the degradation of the GMM from a ternary model to a binary model. A similar situation occurs in Cheng, where a large number of averages are within 2%. This means that the other two components occupy 98% of the weight of GMM and the GMM degrades to some extent, thus leading to the inability of the model to model the data. This is probably caused by the decoding network's compromise among the three parameters. To avoid this entanglement of different parameters from a single network output, we use three separate hyperprior decoders and entropy parameter networks to decode the parameters here. In fact, the increase in complexity is limited because the tensor processed by the hyper model is downsampled several times.
The 3-rd column in Fig. 2   of this average is more evenly distributed between 0 and 10%. In regions with relatively simple image textures, our model also degenerates into a binary Gaussian distribution, implying that the data itself may not need a complex distribution to be modeled. In contrast, in regions with complex image textures, such as the woman's hair and the lighthouse, our model uses a more complex ternary Gaussian distribution, which is more reasonable to model complex data distributions. This comparison visually demonstrates how our proposed method improves the performance of the original GMM approach in Cheng's work.

C. Network Architecture and Training
As shown in Fig. 3, we use a network structure similar to Cheng [11], which employs the attention mechanism and cascaded residual blocks. The difference is that we propose to use separate hyperprior decoders in this framework. The decoded hyper latent code is fed to the three separate hyperprior decoders for decoding, and the obtained tensor is concatenated with the output of the context model and fed to the entropy parameter network to yieldω,μ andσ, respectively.
In training, the Lagrangian multiplier-based rate distortion loss of our model is where q(ỹ,z|x) denotes the variational posterior in the autoencoder. Model pz |ψ (z|ψ) denotes the non-parametric, fully factorized density model [8] used to encode z, which can be formulated by III. EXPERIMENT A. Experimental Setting Training. We trained our model with CLIC training dataset [20] containing approximately 1600 images, using MSE with λ in the set {0.0016, 0.0032, 0.0075, 0.015, 0.03, 0.045} and MS-SSIM with λ in the set {3, 12, 40, 120} as quality metrics for optimization. We named the model optimized with the MSE metric as Model MSE, and the models trained with MS-SSIM metric as Model MS-SSIM. Hyper parameter N is set as 128 for the lower-rate models and set as 192 for the higherrate models, following the setting in the work of Cheng et al. [11]. We use a randomly selected and cropped subset of the training set as the validation set containing 48 256 × 256 patches. The batch size was set to 8 and 1.08M iterations were conducted for each model to reach stable results. The models were optimized using Adam [21]. The learning rate was maintained at a fixed value of 1 × 10 −4 during training, and was reduced to 1 × 10 −5 for the last 80K iterations. We chose variance scaling initializer for the filter kernel and zeros initializer for the bias vector. The CPUs and GPUs in all experiments are Intel Xeon Gold 6230 CPU @ 2.10GHz and Nvidia RTX 2080 Ti GPU, respectively. Evaluation. We used Kodak dataset [22], CLIC Professional Validation dataset [20], and HEVC test sequences [23] to evaluate the robustness of our method. Note that the HEVC dataset contains some video sequences in YUV format. We used the multimedia processing tool FFmpeg to convert the 1-st frame of each sequence into a PNG image, and finally combined them into a new dataset. Bits per pixel (BPP) is used to measure the rate, while PSNR and MS-SSIM are used to measure the image quality. For the implementation of MS-SSIM, we choose to use the calculation method of TensorFlow [24]. The BD-rate [25] is used to quantitatively compare the compression performance between different codecs. Compared with the Rate-Distortion curves, the advantage of BD-rate is that it can quantitatively show Rate-Distortion performance regardless of whether the bit rate difference between models is obvious or subtle. We use an excel template proposed in [26] for BD-rate calculation based on piece-wise cubic interpolation. For a fair comparison with the traditional codecs, the encoding and decoding times of the learned image codec are tested under CPU-only conditions.

B. Performance Evaluation
Rate-distortion Performance. As shown in the Table II, we compare the proposed method with traditional codecs, including JPEG [1], WEBP [2], AVIF [3], BPG [4], HEVC (HM-16.16, x265-3.0) [5], VVC (VTM-11.2, VVenC-1.1.0) [6] and learned codec [11]. Because for codec HEVC and VVC the input and output are ususlly in YUV format, so we use PIL library [27] to realize the exchange of the RGB format and the YUV format of the images. The format is YUV420 for VVenC, because currently only this format is supported, while for HM, x265 and VVC the format is YUV444. All images are encoded with coding structure of all intra (AI) for codec HEVC and VVC. We set a series of QP values for every traditional codec, and the value of each QP remains constant during the compression process. Finally we select reasonable results with the quality that is closest to the quality measured by our models, which makes the calculation of BDrates robust. Compared to these codecs, our models optimized by PSNR and MS-SSIM both achieve the best performance under three different test datasets. To compare the differences with the learned codecs more intuitively, we use Cheng [11] as the anchor and calculated the BD-rate reduction and the relative coding times as shown in the Table I. In order to make performance evaluation more convincing, Table I also include the test results under VMAF metric, which are the average of Model MSE and Model MS-SSIM for each dataset. It can be seen that our method achieves BD-rate reduction of 2.12%, 3.36% and 2.24% under PSNR, MS-SSIM and VMAF metrics, respectively. For the HEVC ClassE dataset, our model optimized by MSE achieves a BD-rate reduction of 3.89%. For the HEVC ClassC and HEVC ClassD datasets, our model optimized by MS-SSIM achieves a BD-rate reduction of about 4%. Our models achieve the highest BD-rate reduction of 5.17% under VMAF metric for the HEVC ClassE  dataset. Overall, our method achieves better RD performance compared to these traditional codecs and Cheng's method.
Ablation Study. We strictly used the same dataset and training steps to train the proposed model and Cheng's model for a fair comparison, and their loss curves are shown in Fig.  4. It shows that the proposed method has a smaller validation loss than the Cheng's method for models with N=128 and N=192. For the model with N=128, the proposed method improves the performance a little more than that of N=192. The rate-distortion results of the final converged model have been demonstrated in the previous subsection.
We also conducted an ablation study on the model complexity to demonstrate the effectiveness of our approach. We reserved the network structure of a single hyper decoder in Cheng's model, and purely increased the number of channels in each convolution layer in hyperprior decoder and entropy parameters, while the parameters in other modules were consistent. We name this network structure with higher complexity as Cheng*. The details about the individual layers with difference are shown in Table III. The total model size of Cheng* is slightly larger than the model we proposed. The loss curves of the three are shown in Fig. 5. It can be observed that Cheng* cannot achieve a smaller loss than Cheng's model like the method we proposed. So in this case, the performance bottleneck doesn't lie in the amount of parameters, but the use of a single hyper decoder, which further proves the effectiveness of our proposed approach. Complexity. The average absolute coding times of all datasets by different codecs are given in Table II, and the relative coding complexities compared to Cheng are given in Table I. Compared with the best-performing traditional method VTM, the absolute encoding time of our model is only about 1/4 of that of VTM. We know that VTM is the reference software of VVC, and it does not accurately reflect the complexity of a real implementation. Therefore, we tested the performance and complexity of VVenC, which is a practical implementation for VVC. Although the encoding and decoding time of VVenC is shorter, the average BDrate reduction of VVenC with AVIF as the anchor is only 18.53%, while that of our proposed approach is 45.78%. The performance difference between the two codecs is very significant. Compared with learned codec [11], the average encoding and decoding complexity of our model increases by 4.36% and 3.04%, respectively. To quantitatively compare the complexity, we computed the FLOPs of these models as shown in Table IV. Compared to Cheng, the FLOPs of our model only increase by about 2%. Note that the tensor processed by the hyper coding loop undergoes 4 downsampling operations in the main coding loop, indicating that the height and width of  Table III, the impact on coding time and FLOPs is very limited.
Subjective Quality Evaluation. We picked the RaceHorses in HEVC ClassC for a subjective quality comparison as shown in Fig. 6. At about 0.12bpp, the horse's mane and belts show a certain degree of distortion in VTM, and their quality is visibly worse in BPG and JPEG. In contrast, our model optimized by MSE retains the horse's mane. For our model optimized by MS-SSIM, both textures of the horse's mane and belts are well preserved and achieve a good subjective quality performance.

IV. CONCLUSION
For variational autoencoder-based image compression, a direct and effective strategy to improve the model performance is proposed in this paper. By using separate hyperprior decoders for parameters of different physical significance in GMM, the value of the minimum weight in the GMM is improved. This results in improving the model's ability to compress complex images by generating more complex distributions to model it. Compared to previous work of Cheng, our method achieves BD-rate gains of 2.12%, 3.36% and 2.24% in terms of PSNR, MS-SSIM and VMAF metrics, respectively, while the cost of it to the coding time and FLOPs is negligible.