Comprehensive Comparisons of Uniform Quantization in Deep Image Compression

In deep image compression, uniform quantization is applied to latent representations obtained by using an auto-encoder architecture for reducing bits and entropy coding. Quantization is a problem encountered in the end-to-end training of deep image compression. Quantization's gradient is zero, and it cannot backpropagate meaningful gradients. Many methods have been proposed to address the approximations of quantization to obtain gradients. However, there have not been equitable comparisons among them. In this study, we comprehensively compare the existing approximations of uniform quantization. Furthermore, we evaluate possible combinations of quantizers for the decoder and the entropy model, as the approximated quantizers can be different for them. We conduct experiments using three network architectures on two test datasets. The experimental results reveal that the best approximated quantization differs by the network architectures, and the best approximations of the three are different from the original ones used for the architectures. We also show that the combination of quantizers that uses universal quantization for the entropy model and differentiable soft quantization for the decoder is a comparatively good choice for different architectures and datasets.


I. INTRODUCTION
I MAGE compression is a fundamental image processing task. It saves costs for storage and Internet traffic by reducing the bits of images. Traditional standards such as JPEG [1], JPEG2000 [2], BPG [3], and WebP [4] are image compressions using hand-crafted modules, and each module is optimized separately. Recent studies on image compression have been conducted based on deep neural networks (deep image compression) [5], [6]. Deep image compression optimizes modules in an end-to-end manner, in contrast to traditional methods. Recently, considerable progress has been made, resulting in higher performance over traditional compression methods [7], [8].
Deep image compression comprises four modules: an encoder, a quantizer, a decoder, and an entropy model. The encoder extracts latent representations from an image. The quantizer quantizes the latent representations for reducing bits. The decoder reconstructs the image from the quantized representations to be close to the input. The entropy model estimates the probabilities of the quantized representations for entropy coding. Deep image compression optimizes these modules in an end-to-end manner as a joint rate-distortion optimization problem.
Quantization is a critical component of image compression. Quantization is mainly classified into two types: uniform and non-uniform. In the traditional image compressions using orthogonal transformations, the rate-distortion optimized quantization is uniform quantization [9]. In the deep image compression, uniform quantization is also a standard operation [5], [7], [10], [11]. However, uniform quantization is defined differently during training -quantizers are always approximated for the end-to-end training to back-propagate meaningful gradients because the gradient of the quantization is zero almost everywhere.
There have been proposed several approximations of uniform quantization. Additive uniform noise (AUN-Q) [5] is the most standard approximation. AUN-Q allows a continuous relaxation of the probability mass function of the quantized representation and approximates the quantization error. Other studies proposed other quantization approximations such as rounding with a straight-through estimator (STE-Q) [12], universal quantization (U-Q) [13], [14], • We compare seven approximations of uniform quantization. We also compared their combinations for a decoder and an entropy model. To the best of our knowledge, there have been no comprehensive comparisons of the approximated quantization for deep image compression. • We find that there is no unique solution -in other words, the best approximated quantization differs depending on the network architectures. We also confirm that using the best approximation instead of the one used in the original deep compression improves the performance. • We find that the combination of U-Q for an entropy model and DS-Q for a decoder is a comparatively good choice for different architectures and datasets.
The paper is an extended version of our conference paper [26], which was extended in multiple aspects. We include additional quantization approximation methods such as SGA-Q [15], STH-Q [16], and DS-Q [22] in comparison, evaluate approximated quantization in various bitrates with more datasets on more network architectures, and describe more insights on the experimental results.
The remainder of this paper is organized as follows. In Section II, we give related works on deep image compression. In Section III, we describe the overview of our comparison. In Section IV, we evaluate the combinations of approximated quantization. In Section V, we conclude this paper.

A. DEEP IMAGE COMPRESSION
In the early stages, some studies addressed deep image compression that optimizes only in terms of distortion [27]- [29]. They used recurrent neural networks as the encoder and decoder and changed the bitrates by iterations during testing. Recent studies have formulated deep image compression as a joint rate-distortion problem. They used four modules: an entropy model, an encoder, a decoder, and a quantizer. We present some highlights of relevant previous studies on these modules. The review paper [30] describes these more exhaustively.
The entropy model estimates the probability distribution of quantization output. In the early stage, the probability distribution was estimated per pixel by employing a factorizedprior model [5] or Gaussian scale mixtures [12]. Ballé et al. [10] presented a hyper-prior model that parameterized the probability distribution as a zero-mean Gaussian distribution to capture the spatial redundancy of the quantization output. The parameters of the probability distribution were estimated Many studies used architectures proposed in image superresolution to improve the encoder and decoder. Li et al. [31] removed the batch normalization layer [32] from the residual blocks [33] in their experiments. Attention-based architectures such as channel attention [34] and non-local attention [35] were also used in [36] and [7], [37], [38], respectively. Some studies proposed task-specific architectures for deep image compression. Lin et al. [39] proposed a spatial recurrent neural network to remove redundant information between adjacent blocks. Wang et al. [40] addressed an ensemble of encoders and entropy models in deep image compression with block-wise coding.
These methods use quantization in their pipeline regardless of the architecture of the entropy model, encoder, and decoder. In this study, we focus on a quantizer and detail related works on quantization.

B. QUANTIZER IN DEEP IMAGE COMPRESSION
Quantization is classified into uniform quantization and non-uniform quantization. Similar to traditional compression methods such as JPEG [1], uniform quantization is generally used in deep image compression. AUN-Q [5] has been the most widely used approximation [6], [7], [10], [12] since it was proposed in [5]. Choi et al. [13] introduced universal quantization [41], [42] for approximated quantization. They proposed U-Q as another alternative for AUN-Q and achieved better performance than AUN-Q. Theis et al. [12] proposed STE-Q, which extended binarization with STE [43] for deep image compression. Yang et al. [15] proposed SGA-Q to refine the quantized latent representations by iterative inference. Guo et al. [16] presented STH-Q. They firstly trained with AUN-Q. Then, they trained only a decoder and an entropy model by performing quantization without approximation to reduce the mismatch between training and testing.
In general, the approximated quantization is the same for the decoder and the entropy model; however, they can be different. In this study, we evaluate the combinations of different approximated quantization. [18] is so far the only work that used different approximated quantizers. They used STE-Q for the decoder and AUN-Q for the entropy model. They achieved better performance than only using AUN-Q. This combination resulted in an improvement of the compression performance [17], [18], [44]. We consider that the effective approximation method differs between the decoder and the entropy model. Therefore, we comprehensively compare the combinations of several approximation methods for a decoder and an entropy model.
Pan et al. [45] studied the mechanism of the approximated quantization. They identified three gaps in the approximated quantization: discrete gap, entropy estimation gap, and local smoothness gap. They analyzed these gaps and addressed them by proposing soft-STE. In contrast, our study focuses on an empirical comparison of existing methods and their combinations.
Non-uniform quantization uses a non-uniform quantization interval instead of a uniform quantization interval. Some studies based on clustering techniques [25], [36], [46] learned the quantization interval. Cai et al. [47] alternatively optimized the quantization interval and the encoder-decoder.
Although several studies have used non-uniform quantization, non-uniform quantization is not standard in current deep image compression studies (e.g., a state-of-the-art compression method [21] uses uniform quantization). Therefore, we mainly focus our comparison to uniform quantization.
A few works investigated other quantization. Li et al. [48] proposed to incorporate trellis coded quantization. Löhdefink et al. [49] proposed a one-hot max quantization. Although these methods showed better performance at low rates such as 0.1 bits per pixel (BPP), they perform worse at rates higher than 0.45 BPP.

III. METHOD
We compare approximated quantization and their combinations comprehensively. We first explain the outline of deep image compression in our comparison. Thereafter, we explain existing approximated quantization and the variants that we compare. Then, we explain their combination for a decoder and an entropy model. Finally, we explain the implementation of quantization for compression models using a hyper-prior model [10] as the entropy model.

A. OUTLINE OF OUR COMPARISON
Image compression aims to compress an image into a small number of bits and reconstruct the image from them. Deep image compression achieves this goal using four modules: a decoder, an encoder, an entropy model, and a quantizer. In our comparison, we use two quantizers for an entropy model and a decoder instead of a single quantizer. We show the outline in Fig. 1. Given quantized latent representations extracted by an encoder and two quantizers from an image, a decoder reconstructs the image, and an entropy model estimates the probabilities for entropy coding. These modules are learned in an end-to-end manner using a joint rate-distortion optimization framework.
We explain the outline of deep image compression in our comparison with mathematical formulations. Let x ∈ R N be a vector of the original image, f : R N → R M be an encoder, q {ent,dec} : R M → R M be two quantizers for an entropy model and a decoder, and g : R M → R N be a decoder. N ∈ Z is HW C, where H, W, C ∈ Z are the height, width, and number of channels of the original image, respectively. M ∈ Z is H W C , where H , W , C ∈ Z are those of the latent representations.
Deep image compression aims to minimize the distortion and the rate jointly. Let D : R N × R N → R be the distortion, R : R M → R be the rate, and λ ∈ R be a hyper-parameter to balance the output of D and R. The loss function is written as If q ent = q dec , we use the same approximated quantization for an entropy model and a decoder in our implementation. It is equal to using only a single quantizer. During testing, q {ent,dec} becomes a quantizer that is a round function. Therefore, the quantization during testing does not depend on the approximation method during the training time.

B. DETAILS OF EXISTING QUANTIZATION METHODS
Quantization should be differentiable to back-propagate meaningful gradients for end-to-end learning. We explain five existing approximated quantizations and their two variants that we compare in our experiments. We list them in Table 1. The five existing methods are AUN-Q, STE-Q, U-Q, SGA-Q, and STH-Q. The two variants are DS-Q and SRA-Q. , where U is a uniform distribution. AUN-Q is written as follows: The gradient is one. AUN-Q approximates the quantization operation well in the probability distribution. However, there is an apparent gap between training and testing; noise is not added during testing. This is called the discrete gap in [45]. STE-Q [12] approximates the gradient of quantization using STE [43]. The forward pass is quantization without approximation and is written as where · is the round function. For back-propagation, STE assumes the forward pass is the identity function; therefore, the gradient is one. The advantage of STE-Q is that the operation during training is equivalent to that during testing. STE-Q does not learn the probability distribution of quantized representations. This is called the entropy estimation gap in [45]. U-Q in [13] applies universal quantization [41], [42] to deep image compression. Given a common uniform noise u ∈ U [− 1 2 , 1 2 ], U-Q is written as follows: Ballé17 [5] three convolutional layers factorized-prior AUN-Q Ballé18 [10] four convolutional layers hyper-prior AUN-Q Cheng20 [7] residual blocks and simplified attention blocks hyper-prior and autoregressive context AUN-Q The probability density functions of U-Q and AUN-Q are the same. The gradient of U-Q is approximated by STE, and the gradient is one. U-Q achieves better performance than AUN-Q. SGA-Q [15] perform rounding stochastically instead of deterministically. SGA-Q becomes more deterministic in the latter part of training. We visualize SGA-Q in Fig. 2. Let τ = min(0.5, 0.5 exp(−c(t − t 0 ))) ∈ R where t ∈ Z is the iteration and c ∈ R and t 0 ∈ Z are hyper-parameters to adjust τ . SGA-Q can be written as where δ ∈ {0, 1}. P (δ = 0) = 1−p τ ∝ exp(−arctanh(y i − y i )/τ ) and P (δ = 1) = p τ ∝ exp(−arctanh( y i − y i )/τ ). As the number of iterations increases, τ reduces gradually and P (δ = 0) and P (δ = 1) become close to zero or one. SGA-Q behaves stably around the boundary of rounding, in contrast to STE-Q. We approximate the gradient using the Gumbel-softmax trick [50], [51]. Yang et al. [15] used this technique only for iterative inference, but we used it for training. By using the Gumbel-softmax trick, the forward pass is rewritten as follows: where g 0 and g 1 are sampled from the Gumbel(0, 1) distribution and h τ (x) = exp(x/τ ). The gradient is calculated using this forward pass. Because SGA-Q adds noise drawn from a Gumbel distribution, it is considered as a noise-based method.
In STH-Q [16], a compression model is first trained using AUN-Q. Thereafter, it disables the approximation of the quantization at the t 0 -th iteration to train a decoder and an entropy model, where t 0 is a hyper-parameter. Therefore, the gradient is one before the t 0 -th iteration and zero thereafter. The advantage of this approach is that there is no quantization gap between the training time after the change and the testing time. However, the encoder can be suboptimal because it is not trained jointly with the decoder and the entropy model in the latter part of the training.
DS-Q [22] is a variant of STE-Q. It calculates the gradient assuming the forward pass uses the tanh function. While STE-Q calculates the gradient assuming that the forward pass is the identity function, this causes a gradient mismatch. The gradient mismatch leads to a suboptimal solution. DS-Q reduces this gradient mismatch by approximating it via a continuous function closer to the round function. We visualize DS-Q in Fig. 3. The equation is written as follows: where The approximated quantization becomes close to the round function as k increases. It becomes close to the identity function if k is small. We treat k as a hyper-parameter following [14]. We do not increase k gradually as [14] -we empirically compare increasing and fixing k and adopt the better. In [22], k was treated as a learnable parameter, but this is not successful in our experiments. SRA-Q is a variant of SGA-Q. In the forward pass, SRA-Q performs stochastic rounding as shown in Eq. 5. SRA-Q calculates the gradient following the stochastic rounding in [12], where the gradient becomes one after calculating it by its expectation.

C. COMBINATION OF APPROXIMATION METHODS FOR TWO QUANTIZERS
We apply these seven approximation methods to two quantizers separately. Specifically, we evaluate seven approximated quantization methods and pairs of six approximation methods for an entropy model and a decoder. We exclude STH-Q in making pairs because it is not separable for an entropy model and a decoder. The total number of approximated quantization that we evaluate is 7 + 6 P 2 = 37 in total.

D. QUANTIZATION FOR A HYPER-PRIOR MODEL
A hyper-prior model [10] applies an auto-encoder to latent representations to capture the redundancy of the latent representations. The extracted features by the auto-encoder are called hyper latents. They are quantized to compress by entropy coding like the latent representations. We used the same quantization method to the hyper latents following Lee et al. [18]. The amount of this overhead is negligible compared to the latent representation.

IV. EXPERIMENTS A. EXPERIMENTAL SETTINGS
Regarding the training data, we used a subset of Ima-geNet [52]. We used high-resolution images whose shorter edge is larger than 256 pixels. We cropped the images at random locations to obtain patches with a resolution of 256× 256 pixels. We used two datasets for the test data: the Kodak dataset [23] and the CLIC 2020 professional validation dataset (CLIC) [24]. The Kodak dataset comprises 24 images.  The resolution of the images is 512×768 or 768×512 pixels. The CLIC dataset comprises 41 images. The resolution of the many images is approximately 2, 048 × 1, 360 pixels. We made evaluations using three network architectures: Ballé17 [5], Ballé18 [10], and Cheng20 [7]. We summarized the network architectures in Table 2. The latter increases the complexity of the architecture. Specifically, Cheng20 uses six and seven residual blocks [33] for the encoder and decoder, respectively. They used simplified non-local attention modules in these layers. The entropy model is based on Gaussian mixtures, whose parameters are predicted by a hyper-prior and an autoregressive context model. The quantizer is approximated using AUN-Q. Some convolutional layers in the encoder and decoder of these three networks are followed by generalized divisible normalization [53].
We used Adam optimizer [54]. The number of iterations  [55] as a metric. This metric indicates the bitrate savings compared to a baseline method in the same distortion. We evaluated the distortion by the peak signal-to-noise ratio (PSNR) and the bitrate by BPP to compute the BD rate. In our experiments, we treated compression models using AUN-Q as the baseline method for each network architecture.
We tuned hyper-parameters for SGA-Q, SRA-Q, DS-Q, and STH-Q on Ballé17 [5] on the Kodak dataset. We set c = 0.0003, t 0 = 960, 000 for SGA-Q, c = 0.0003, t 0 = 990, 000 for SRA-Q, and t 0 = 960, 000 for STH-Q. We fixed k to 0.1 for DS-Q because increasing k gradually does not improve the performance. For the reproducibility of our experiments, we provide our code at https://github.com/ kktsubota/uniform-quantizers.
We first performed experiments using Ballé17 for the exhaustive cases, and then using Ballé18 and Cheng20 for limited cases to reduce experimental costs. For Ballé18, we conducted experiments using approximations whose BD rate in the Kodak dataset is less than -4% in Ballé17. For Cheng20, we conducted experiments using approximations whose BD rate in the Kodak dataset is less than -4% in Ballé18. In experiments, we train these models two times and reported the average score. Cheng20 + ours Cheng20 [7] BPG JPEG2000 JPEG (c) Cheng20 FIGURE 4: Rate-distortion curve on the Kodak dataset. Ballé17 + ours, Ballé18 + ours, and Cheng20 + ours denote Ballé17 [5], Ballé18 [10], and Cheng20 [7] using the best approximation, respectively.

B. EVALUATION OF APPROXIMATED QUANTIZATION
We report the results of Ballé17 [5], Ballé18 [10], and Cheng20 [7] in Table 3, 4, and 5, respectively. We find that using the best approximation for each network architecture outperforms its original approximation methods. The improvement in BD rate is -9.38%, -6.94%, and -3.15%, respectively. We observe that the improvement becomes less when the network architecture is more powerful. The best approximation for each network architecture is different from the approximation methods that have been proposed in previous studies.
The best approximation differs depending on the network architectures. The combination of U-Q for an entropy model and SRA-Q for a decoder is best for Ballé17, whereas the combination is not best for Ballé18 and Cheng20. The best approximation for Ballé18 is the combination of AUN-Q and DS-Q, and the best combination for Cheng20 is the combination of AUN-Q and U-Q. The results indicate that we might need to explore approximation methods per network architecture for the best approximation.
We calculated the average BD rate over the three network architectures and reported it in Table 6. The results show that the combination of U-Q for an entropy model and DS-Q for a decoder is comparatively good among them.
We finally discuss the tendency of good approximation methods. In Table 3, we can observe that using noise-based methods such as AUN-Q, U-Q, and SGA-Q for an entropy model and rounding-based methods such as STE-Q, DS-Q, and SRA-Q for a decoder leads to better performance. When using a single quantizer, noise-based methods such as AUN-Q, U-Q, and SGA-Q are better than deterministic roundingbased methods such as STE-Q and DS-Q. It indicates that the entropy estimation gap [45] is more important than the discrete gap [45].

C. PERFORMANCE OF THE BEST APPROXIMATED QUANTIZATION
We demonstrate the performance of the best approximation using the rate-distortion curve. We used PSNR and BPP to evaluate the distortion and the rate, respectively, to plot the rate-distortion curve. We used three network architectures (Ballé17, Ballé18, and Cheng20) that we used in the previous experiments, and adopt the best approximated quantization for each network architecture. We compared them with these network architectures using original approximated quantization. We also compared with traditional methods, i.e., JPEG [1], JPEG2000 [2], and BPG [3], for reference. These methods are implemented in OpenCV, OpenJPEG [56], and [3]. We ran these programs with the default configurations. The results on the Kodak dataset are shown in Fig. 4  stated in the previous experiments, using the best approximation is better than using the original approximation.
We present the qualitative results for the Kodak dataset in Fig. 5, and the CLIC dataset in Fig. 6, which show the results of three traditional methods, Cheng20, and Cheng20 using the best approximation (Cheng20 + ours). The results demonstrate that Cheng20 + ours achieves better visual quality with a lower bitrate than traditional methods. Compared to Cheng20, we can observe an improvement in visual quality. For example, in the upper right of the cropped patches in Fig. 5, we can see clear stripes in our method. In the center of the cropped patches in Fig. 6, we can see relatively clear  performance by analyzing per patch. We conducted this analysis on Cheng20 with λ = 0.0075 using the Kodak dataset. We split each test image into 16 × 16 patches and classified them into three categories based on the difficulty levels of compression: easy, medium, and hard. We defined the difficulty levels by the test loss of Cheng20 and treated patches in the best 25% as easy patches, patches in the worst 25% as hard patches, and the other patches as medium patches.
To compute the loss function, we measured the bitrate of each patch based on the log-likelihood of the corresponding pixel in the latent representation. We show the performance by difficulty levels in Table 7. The results indicate the best approximation improves the performance on easier patches. We also investigated the reason and found that the number of bits reduces on the flat region as shown in Fig. 7.

D. COMPARISON BETWEEN UNIFORM AND NON-UNIFORM QUANTIZATION
In this section, we compare uniform and non-uniform quantization. We compare AUN-Q, a standard uniform quantization, with a basic non-uniform quantization method (NU-Q) proposed in [25]. NU-Q quantizes the input values into learnable centers whereas AUN-Q quantizes the input values into a fixed integer grid.
It is challenging to compare these quantization methods in the same condition because the architecture of entropy models for these methods is completely different. Therefore, we adopted an advanced entropy model called a 3D-CNNbased model [25] for NU-Q and a primitive entropy model called a factorized-prior model [5] for AUN-Q. This makes it possible to state that AUN-Q is superior to NU-Q when AUN-Q & factorized-prior model performs superior to NU-Q & 3D-CNN-based model. Note that the 3D-CNN-based model is more advanced because it uses adjacent pixels and channels as context, unlike the factorized-prior model. We conduct experiments based on the three architectures of the encoder and decoder: Ballé17, Ballé18, and Cheng20. We modified the architectures of the encoder following the non-uniform quantization method [25] -we masked the latent representation with a binary mask obtained by increasing the number of output channels of the encoder. We also modified the loss function for the bitrate following [25] -we computed both the original and masked coding cost. We set λ to {0.001, 0.003, 0.01, 0.03} for AUN-Q, and {0.0003, 0.001, 0.003, 0.01} for NU-Q to align the BPP with AUN-Q. We trained the models with these hyper-parameters once and computed the BD rate to AUN-Q. To evaluate the bitrate, we used estimated BPP instead of actual BPP for ease of implementation. Other settings are the same with Sec. IV-A.
We show the experimental results on the Kodak dataset in Table 8. Although an advanced entropy model is used for NU-Q, AUN-Q performs superior to NU-Q on all three architectures of the encoder and decoder. This indicates that the performance difference is due to the type of quantization and that uniform quantization is superior to non-uniform quantization. Comparison using the same entropy model leads to a more fair comparison and is a future work of our study.

V. CONCLUSION
We performed comprehensive comparisons of approximated quantization for deep image compression. We compared seven approximation methods and their combinations for a decoder and an entropy model. We evaluate these methods using three standard network architectures on two datasets. The experimental results demonstrate that the best approximation method outperforms existing approximation methods, whereas different approximation is best depending on the network architectures. We also find that the combination of U-Q for the entropy model and DS-Q for the decoder is a good choice on average. KOKI TSUBOTA received the B.E. and M.S. from the University of Tokyo, in 2018, 2020, respectively. He is now a Ph.D. student of Department of Information and Communication Engineering. He is interested in computer vision and multimedia. He works on embedding learning, image compression, image quality assessment, and various topics of manga image processing -manga object detection, manga character clustering, etc.
KIYOHARU AIZAWA received the B.E., the M.E., and the Dr.Eng. degrees in Electrical Engineering all from the University of Tokyo, in 1983Tokyo, in , 1985Tokyo, in , 1988, respectively. He is currently a Professor at Department of Information and Communication Engineering of the University of Tokyo. He was a Visiting Assistant Professor at University of Illinois from 1990 to 1992. His research interest is in image processing and multimedia applications. He is a council member of Science Council of Japan. VOLUME 4, 2016