ANFIC: Image Compression Using Augmented Normalizing Flows

This paper introduces an end-to-end learned image compression system, termed ANFIC, based on Augmented Normalizing Flows (ANF). ANF is a new type of flow model, which stacks multiple variational autoencoders (VAE) for greater model expressiveness. The VAE-based image compression has gone mainstream, showing promising compression performance. Our work presents the first attempt to leverage VAE-based compression in a flow-based framework. ANFIC advances further compression efficiency by stacking and extending hierarchically multiple VAE's. The invertibility of ANF, together with our training strategies, enables ANFIC to support a wide range of quality levels without changing the encoding and decoding networks. Extensive experimental results show that in terms of PSNR-RGB, ANFIC performs comparably to or better than the state-of-the-art learned image compression. Moreover, it performs close to VVC intra coding, from low-rate compression up to nearly-lossless compression. In particular, ANFIC achieves the state-of-the-art performance, when extended with conditional convolution for variable rate compression with a single model.


I. INTRODUCTION
Image compression has been a thriving research area for decades due to the storage and transmission requirements in various applications that underpin our modern digital life. Image compression also appears in the form of intra-frame coding for video compression [1]. The rapid advances in interframe prediction make efficient intra-frame coding become increasingly important because intra-coded frames often predominate over the bit rate of a compressed video. Therefore, it is much desirable to achieve even higher image compression efficiency.
The state-of-the-art image compression methods, e.g. BPG and VVC intra coding, usually involve block-based intra prediction, block-based transform coding of residuals, and context-adaptive binary arithmetic coding. Over the years, tremendous research effort has been invested to better every component in a way that seeks higher compression efficiency at the expense of an acceptable complexity increase. These hand-crafted codecs, although achieving a good balance between compression efficiency and complexity, lacks the opportunity to optimize all the components jointly in a seamless, end-to-end manner.
The rising of deep learning recently spurred a new wave of developments in image compression, with end-to-end learned systems attracting lots of attention. Among them, the variational autoencoder (VAE)-based methods [2], [3], [4], [5] have achieved compression performance very close to the latest VVC intra coding. Different from traditional hand-crafted Manuscript received July 1, 2021; revised September 29, 2021; accepted October 15, 2021. This work was supported by National Center for High-Performance Computing, Taiwan.
Hsueh-Ming Hang is with the Department of Electronics Engineering, National Yang Ming Chiao Tung University, Hsinchu, Taiwan (e-mail: hmhang@nctu.edu.tw).
Marek Domański is with the Institute of Multimedia Telecommunications, Poznań University of Technology, Poznań, Poland (e-mail: marek.domanski@put.poznan.pl). codecs, the VAE-based methods usually implement an imagelevel non-linear transform that converts an input image into a compact set of latent features, the dimensions of which are much smaller than the input image. Ever since the advent of the first VAE-based scheme [6], several improvements have been made on the expressiveness [4], [5], [7] of the autoencoder and the efficiency of entropy coding [2], [3], [4], [5], [8], [9], [10]. Up to now, the VAE-based methods have become the mainstream approach to end-to-end learned image compression.
However, one issue with most VAE-based schemes is that the autoencoder is generally lossy. There is no guarantee that its non-linear transform can reconstruct the input image losslessly even without quantizing the latent features of the image. This is unlike the traditional transforms, such as Discrete Cosine Transform and Wavelet Transform, which have the desirable property of perfect reconstruction and allow the codec to offer a wide range of quality levels by merely changing the quantization step size.
Recently, the flow-based models [11], [12] emerged as attractive alternatives. These models have the striking feature of realizing a bijective and invertible mapping between the input image and its latent features via the use of reversible networks composed of affine coupling layers [13], [14]. This invertibility is utilized to develop lossless image compression in [15], while the affine coupling layers are used in place of the lossy autoencoder in [11], [12] to achieve both lossy and lossless (or perceptually lossless) compression with a single unified model. The reversible networks, however, are quite distinct from the commonly used autoencoders, making these two types of compression systems not compatible with each other.
In this paper, we propose a novel end-to-end lossy image compression system, termed ANFIC, based on Augmented Normalizing Flows (ANF) [16]. ANF is a new type of flow models that work on augmented input space to offer greater transformation ability than the ordinary flow models. Our scheme ANFIC is motivated by the fact that ANF is a gener-alization of VAE that stacks multiple VAE's as a flow model. In a sense, this allows ANFIC to extend any existing VAEbased compression system in a flow-based framework to enjoy the benefits of both approaches. ANFIC is novel and unique in that (1) it distinguishes from flow-based compression by operating in augmented input space, being able to leverage the representation power of any VAE-based image compression, and that (2) it is more general than the VAE-based compression by allowing VAE to be stacked and/or extended hierarchically.
Extensive experimental results on Kodak, Tecnick, and CLIC validation datasets show that ANFIC performs comparably to or better than the state-of-the-art end-to-end image compression in terms of PSNR-RGB. It performs close to VVC intra over a wide range of quality levels from lowrate compression up to perceptually lossless compression. In particular, ANFIC achieves the state-of-the-art performance among the competing methods, when extended with conditional convolutional layers [17] for variate rate compression with a single model.
Our main contributions are three-fold: • We propose ANFIC, which uses augmented normalizing flows for image compression, as the first work that leverages VAE-based image compression in a flow-based framework. • We offer extensive ablation studies to understand and visualize the inner workings of ANFIC. • Extensive experimental results show that ANFIC is competitive with the state-of-the-art image compression, VAE-based and flow-based, over a wide range of quality levels and performs close to VVC intra coding. This work improves on our previous publication [18] by (1) replacing the affine coupling layers with additive coupling layers to improve the training stability and avoid degrading the performance, (2) introducing the Gaussian mixture model along with the autoregressive module for better entropy coding, and (3) providing more comprehensive ablation studies of ANFIC.
The remainder of this paper is organized as follows: Section II reviews VAE-based image compression and the basics of ANF. Section III elaborates the design of ANFIC. Section IV compares ANFIC with the state-of-the-art methods in terms of objective compression performance and subjective image quality. Section V presents our ablation studies. Finally, we provide concluding remarks in Section VI.

II. RELATED WORK
In this paper, we propose an ANF-based image compression. It can be viewed as an extension of VAE-based image compression. Hence, this section focuses on the recent developments of VAE-based image compression and introduces the fundamentals of ANF to ease the understanding of our scheme.
The analysis transform g a encodes the raw image x through an encoding distribution q ga φ (y|x) with the latent representation y uniformly quantized asŷ. Theŷ is then entropy encoded into a bitstream using a learned prior p π (ŷ) implemented by a network π. Finally, the synthesis transform g s reconstructs approximately the input x fromŷ by a decoding distribution p gs θ (x|ŷ).
All the network parameters are trained end-to-end by minimizing where the first term, denoted by D, aims to minimize the negative log-likelihood of x and the second term minimizes the rate R needed for signalingŷ. In particular, it is shown that minimizing Eq. (1) amounts to maximizing the evidence lower bound (ELBO) of a latent variable model [19], which is specified by p π (ŷ) and p gs θ (x|ŷ), with q ga φ (ŷ|x) taking a uniform distribution that models the effect of uniform quantization. In a more general setting, a hyper-parameter λ is introduced to balance between D and R, yielding L = λD + R.
Balle et al. [6] are the first to introduce the aforementioned VAE framework together with a learned factorized prior to image compression. In entropy coding the image latents, they assume the prior distribution p π (ŷ) overŷ to be factorial and learn the distribution by the network π. Their analysis and synthesis transforms are composed of convolutional neural networks and the general division normalization (GDN) layers, which originate from [20].
Enhanced Prior Estimation: The prior distribution p π (ŷ) crucially determines the number of bits (i.e. the rate) needed to signal the quantized image latentsŷ. Recognizing the suboptimality of the factorized prior p π (ŷ), where feature samples in every channel ofŷ are independently and identically distributed, Balle et al. [8] propose the notion of hyperprior to model every feature sample separately by a Gaussian distribution. To this end, additional side information z is extracted from the image latent y and sent to the decoder, making the density estimation ofŷ dependent on the input x. Theŷ andẑ form the latent representation of the input x. The hyperprior thus bears the interpretation of factorizing the joint distribution p(ŷ,ẑ) as p(ŷ|ẑ)p(ẑ), where p(ŷ|ẑ) and p(ẑ) are assumed to be Gaussian and factorial, respectively. Hu et al. [3], [10] extend the idea to include more than one layer of hyperprior, leading to a factorization of p(ŷ,ẑ 1 ,ẑ 2 , . . . ,ẑ n ) = p(ŷ|ẑ 1 )p(ẑ 1 |ẑ 2 ), . . . , p(ẑ n ), wherê z 1 ,ẑ 2 , . . . ,ẑ n form a multi-layer hyperprior. In addition to the use of hyperprior, Minnen et al.  et al. [4], and Cheng et al. [5] incorporate an autoregressive prior by 2D [2], [5], [9] or 3D [4] masked convolution [21], in order to utilize causal contextual information for better density estimation. In particular, Cheng et al. [5] model p(ŷ|ẑ) with a Gaussian mixture distribution instead of a Gaussian. Enhanced Autoencoding Transform: The capacity of the autoencoding transform determines its expressiveness. Chen et al. [4] add residual blocks to the autoencoder along with several non-local attention modules (NLAM). NLAM is shown to facilitate spatial bit allocation among coding areas of varied texture complexity. Unlike most of the VAEbased systems, which operate at image level, the blockbased autoencoder in [7] divides the input image into nonoverlapping macroblocks, each of which contains multiple sub-blocks coded sequentially using recurrent-based analysis and synthesis transforms. It has the striking feature of allowing high degree of computational parallelism at macroblock level. In general, most autoencoders are not guaranteed to reconstruct the input perfectly even when no quantization is involved.

B. Flow-based Image Compression
Recently, flow-based models [13], [14] emerge as an attractive alternative to VAE [19] or other autoencoders. They are characterized by the bijective mapping between the input and its latent representation, ensuring that the input can be perfectly reconstructed from its latent in the absence of quantization. Ma et al. [12] make an interesting attempt to introduce lifting-based coupling layers, which are a specialized implementation of additive coupling layers [13], [14] often used to construct a flow model, as the analysis and synthesis backbone. In particular, they split an input image, first rowwise and then column-wise, into latent subbands, the resulting decomposition being similar to 2D wavelet transform. Helminger et al. [11] also use additive coupling layers but with the factor-out splitting to generate a multi-scale image representation as shown in Fig. 1a. Their work extends the notion of integer discrete flows for lossless compression [15] to lossy compression. In common, these works show the potential of flow-based models to offer a wide range of quality levels ranging from low-rate compression to nearly-lossless or even lossless compression.
Our work aims to leverage the developments of VAE-based schemes in a flow-based framework to enjoy the benefits of both (see Fig. 1b). For this purpose, we resort to augmented normalizing flows [16], the basics of which are presented next.

C. Augmented Normalizing Flows (ANF)
The ANF model [16] is an invertible latent variable model. It is composed of multiple autoencoding transforms, each of which comprises a pair of the encoding and decoding transforms as depicted in Fig. 2a. Consider the example of ANF with one autoencoding transform (i.e. one-step ANF). It converts the input x coupled with an independent noise e into their latent representation (y, z) with one pair of encoding and decoding transforms: where s enc π , m enc π , µ enc π , and σ enc π are element-wise affine transformation parameters. These learnable parameters are driven by the encoding and decoding neural networks, the weights of which are referred collectively to as π. Compared with ordinary flow models, ANF augments the input with an independent noise. It is shown in [16] that the augmented input space allows a smoother transformation to the required latent space.
Multi-step ANF and Hierarchical ANF: From Fig. 2a and according to Eqs. (2) and (3), the encoding g enc π or decoding g dec π transform implements an invertible affine coupling layer. Stacking pairs of these coupling layers leads to an invertible network, termed multi-step ANF, with much improved capacity than one-step ANF. Another way to increase the model capacity is to augment more noise inputs as hierarchical ANF (see Fig. 2b). Particularly, these two approaches can be combined in a flexible way for even higher model capacity.
Training ANF: Like the ordinary flow models, ANF can be trained by maximizing the augmented joint likelihood, i.e. arg max π p π (x, e): where G π = g dec π N • g enc π N • . . . • g dec π1 • g enc π1 is the alternate composition of the encoding and decoding transforms with π = {π 1 , · · ·, π N } and p(G π (x, e)) represents the specified or learned prior distribution over the latents (y, z). It is shown in [16] that maximizing the augmented joint likelihood p π (x, e) in ANF amounts to maximizing a lower bound on the marginal likelihood p π (x), with the gap attributed to the model's incapability of modeling e independently of x.

III. PROPOSED METHOD
Inspired by the fact that most learned image compression is VAE-based and that VAE is equivalent to one-step ANF, we propose an ANF-based image compression framework, termed ANFIC. We first outline the ANFIC framework in Section III-A, with a focus on how to extend VAE-based image compression with hyperprior by multi-step and hierarchical ANF. This is followed by discussions on the entropy coding of the latent representation (Section III-B), the modeling of the prior distribution in ANFIC (Section III-A), and the training objective (Section III-C).
To the best of our knowledge, ANFIC is the first work that combines VAE and flow models in a unified framework. It distinguishes from flow-based compression in that it operates on augmented input space (see Fig. 1b), being able to leverage the representation power of any existing VAE-based image compression. Moreover, ANFIC is more general than the VAEbased scheme by allowing it to be stacked and/or extended hierarchically (see Fig. 2).

A. ANFIC Framework
Fig. 3a describes the framework of ANFIC. From bottom to top, it stacks two autoencoding transforms (i.e. two-step ANF), with the top one extended further to the right to form a hierarchical ANF [16] that implements the hyperprior. More autoencoding transforms can be added straightforwardly to create a multi-step ANF. In particular, the g enc π and g dec π in the autoencoding transform follow Eqs. (2) and (3), except that we make them purely additive by removing s enc π (x) and σ dec π (z) for better convergence as with some other flow-based schemes [11], [12].
The autoencoding transform of the hyperprior, which assumes each sample in the latent representation z 2 is a Gaussian, is defined as where · (depicted as Q in Fig. 3a) denotes the nearestinteger rounding for quantizing the residual between z 2 and the predicted mean µ dec π3 (ĥ 2 ) of the Gaussian distibution from the hyperpriorĥ 2 . This part implements the autoregressive hyperprior in [2], with z 2 denoting the image latents whose distributions are signaled as the side informationĥ 2 .
The encoding of ANFIC proceeds by passing the augmented input (x, e z , e h ) through the autoencoding and hyperprior transforms, i.e. G π = g dec π2 • h dec π3 • h enc π3 • g enc π2 • g dec π1 • g enc π1 , to obtain the latent representation (x 2 ,ẑ 2 ,ĥ 2 ). In particular, x represents the input image, e z = 0 denotes the augmented input, and e h ∼ U(−0.5, 0.5), another augmented input, simulates the additive quantization noise of the hyperprior during training. To achieve lossy compression, we wantẑ 2 andĥ 2 to capture most of the information about the input x and regularize x 2 during training to approximate noughts. As such, onlyẑ 2 andĥ 2 are entropy coded into bitstreams. Note that due to the volume-preserving property of ANF (or any flow model), x 2 has the same dimensionality as the input x while that ofẑ 2 andĥ 2 is usually much smaller depending on the design choice. This flexibility allows us to incorporate any existing VAE-based compression scheme as one specific realization of the autoencoding transform in ANFIC. For example, the encoder of any VAE-based compression can be used to implement m enc π (x) for the encoding transform in Eq. (2); likewise, its decoder can realize µ dec π (x) for the decoding transform in Eq. (3). Note that we have assumed the use of additive coupling layers.
To decode the input x, we apply the inverse mapping function G −1 π to the quantized latents (0,ẑ 2 ,ĥ 2 ), where x 2 is set to noughts. In ANFIC, there are two sources of distortion that cause the reconstruction to be lossy: the quantization error of z 2 and the error of setting x 2 to noughts during the inverse operation. Essentially, ANFIC is an ANF model, which is bijective and invertible. The errors between the encoding latents (x 2 , z 2 ) and their quantized version (0,ẑ 2 ) will introduce distortion to the reconstructed image, as shown in Fig. 3b.
To mitigate the effect of quantization errors on the decoded image quality, we incorporate a quality enhancement (QE) network at the end of the reverse path, as illustrated in Fig. 3b. This enhancement network is an integral part of ANFIC, which is constrained by the fact that the analysis and the synthesis transforms must share the same autoencoding transforms (i.e. invertible coupling layers). This constraint makes it difficult to learn a synthesis transform that can effectively compensate for quantization errors while maintaining the invertibility. The same observation was made in [12]. In this paper, we adopt the same lightweight quality enhancement network as [12].
Gaussian Mixtures Extension: ANFIC is flexible in accommodating more sophisticated modeling of p(ẑ 2 |ĥ 2 ), such as Gaussian mixture models. Unlike the single Gaussian model, the mixture model requires to estimate the mixing probabilities w (k) , k = 1, 2, . . . , K for K components as well as the corresponding mean µ (k) and variance σ (k) . All these parameters are functions of the hyperpriorĥ 2 . In the present case, the decoding transform h dec π3 (see Eq. (6)) is changed to h dec π3 (z 2 ,ĥ 2 ) = ( z 2 ,ĥ 2 ) = (ẑ 2 ,ĥ 2 )-namely, an identity transform followed by the quantization of z 2 . This change is necessary because with the mixture model, the subtraction of a single predicted mean from z 2 is not feasible. In addition, p(ẑ 2 |ĥ 2 ) follows a distribution given by

C. Training Objective
Training ANFIC can be achieved by minimizing the negative augmented log-likelihood, i.e. arg min π,ψ − log p π,ψ (x, e z , e h ). This leads to the following loss function: where the Jacobian log-determinant generally prevents the collapse of the latent space. In our implementation, we replace it with a reconstruction loss λ 2 d(x,x), with the distortion metric d(·, ·) being the mean-squared error (MSE) or multiscale structure similarity index (MS-SSIM): L(x, e z , e h ; π, ψ) where π, ψ refer to the parameters of all the networks, including the quality enhancement network. Unlike the traditional weighted sum of rate R and distortion D, our training objective has the additional requirement that x 2 should approximate noughts. This drivesẑ 2 ,ĥ 2 to encode most of the information about the input x, provided that the reconstructed imagex approximate x closely. In passing, we note that the reconstruction loss also prevents the latent space from collapsing. Apparently, it would be difficult to recover the input x if different x's are all mapped to the same point in the latent space.

IV. EXPERIMENTAL RESULTS
This section evaluates the performance of ANFIC both objectively and subjectively. We first present the network architectures, training details, evaluation methodologies, and the baseline methods in Section IV-A. Next, we compare the rate-distorton performance of ANFIC with several state-of-theart methods on commonly used datasets in Section IV-B. Lastly, we evaluate the subjective quality of the reconstructed images in Section IV-C.

A. Settings and Implementation Details
Network Architectures: Our autoencoding transforms for feature extraction (the left branch in Fig. 4) and hyperprior (the right branch in Fig. 4) share similar architectures to the VAE-based scheme in [2]. In addition, we use the same lightweight de-quantization network in [12] as the quality enhancement network. All the autoencoding transforms in our model have separate network weights. To keep the overall model size comparable to that of [2], we reduce the number of channels in every convolutional layer to 128. We adopt the autoregressive and Gaussian mixture model (Section III-B) for entropy coding in all the experiments, with the number K of mixture components set empirically to 3, which is found to be most effective in [5].
Training: For training, we use vimeo-90k dataset from [22]. It contains 91,701 training videos, each having 7 frames. In a training iteration, we randomly choose one frame from each video and crop it to 256 × 256. We adopt the Adam [23] optimizer with a batch size of 32. The learning rate is fixed at 1e −4 during the first 3M iterations, and then we decay to 1e −5 for fine-tuning. The two hyper-parameters (see Eq. (12)) are chosen to have λ 1 = 0.01 * λ 2 , where λ 2 is one of the values from {0.1, 0.05, 0.02, 0.01, 0.005, 0.002} for MSE optimization and from {200, 100, 40, 20, 10, 4} for optimizing MS-SSIM. In particular, we first train our model for the highest rate point. It is then fine tuned with few epochs to obtain the models for lower rate points.
Evaluation: We evaluate our model on commonly used datasets, Kodak [24] and Tecnick [25], which include 24 uncompressed images of size 768 × 512 and 40 images of size 1200 × 1200, respectively. Additionally, we test our model on the CLIC validation datasets [26]. It contains two subdivided datasets: professional and mobile. The former has 41 higher resolution images and the latter 61 images. To evaluate the rate-distortion performance, we report rates in bits per pixel (bpp) and quality in PSNR-RGB and MS-SSIM. Moreover, we use BPG as an anchor in reporting the BD-rates. Note that rate inflation as compared to BPG is reflected by positive BD-rates while rate saving is shown as negative BD-rates.
Baselines: For comparison, the baseline methods include VTM-444, BPG-444, ICLR'18 [8], NIPS'18 [2], ICLR'19 [9], TPAMI'20 [12], CVPR'20 [5], TPAMI'21 [10], and TIP'21 [4]. It is worth noting that TPAMI'20 [12] is a flow-based model, while the other learned codecs are VAEbased.  Table I. Following some prior works, the BD-rate figures for CLIC professional dataset are reported separately in Table I. In terms of PSNR-RGB, one can see that our method shows comparable performance to the state-of-the-art learned codecs, CVPR'20 [5] and TPAMI'20 [12], on Kodak and CLIC datasets. Remarkably, it achieves the best performance among all the learned codecs on Tecnick and CLIC datasets. It however falls short of the VTM model slightly on Kodak, Tecnick and CLIC datasets. In particular, ANFIC displays a tendency to perform worse at low rates. This may be attributed to the fact that additive coupling layers are susceptible to the accumulation and propagation of quantization errors (Fig. 3b). It is important to note in Table I that ANFIC is inferior to VTM in BD-rate saving by a significant margin (7%) on CLIC Professional dataset. Careful examination of the dataset reveals that some images are extremely challenging and not typical of the images found in our training data. All the competing methods are faced with the same issue. It is expected that increasing the diversity of training data will help. Nevertheless, the superiority of ANFIC over BPG is apparent on all the

datasets.
In terms of MS-SSIM, our method performs among top two. It is slightly worse than the top performer, CVPR'20 [5], on Kodak dataset, especially at low rates (See Fig. 5b), but is comparable to ICLR'19 [9], which achieves the best MS-SSIM performance on the CLIC dataset. It is worth mentioning that TPAMI'20 [12], a strong baseline when evaluated with PSNR-RGB, exhibits poor MS-SSIM results because the released model is optimized for MSE only. Also, as noted previously in other studies, the learned codecs outperform VTM and BPG considerably when trained and tested by MS-SSIM.
The model size comparison in Table I suggests that the ratedistortion benefits of ANFIC do not come at the expense of unreasonably huge models. Its model size is between that of TPAMI'20 [12] and CVPR'20 [5], both show competitive ratedistortion performance.

C. Subjective Quality Comparison
Figs. 6 and 7 show the subjective quality comparison between ANFIC (ours), VVC, BPG, and TPAMI'20 [12] on images kodim01 and kodim16 from Kodak dataset. It is seen that our MSE model achieves comparable subjective quality to VVC and TPAMI'20 [12]. As expected, ANFIC optimized for MSE tends to smooth the highly-textured areas, while VVC and HEVC generates clear blocking artifacts in Fig. 7. In particular, TPAMI'20 [12] suffers from geometric distortion especially in the "door" area in Fig. 6 and produces some artificial noisy dots on the "water surface" in Fig. 7. In contrast, our MS-SSIM model shows much better subjective quality, preserving most high-frequency details.

V. ABLATION STUDIES
In this section, we conduct ablation studies to understand ANFIC's properties. Firstly, we show how the ANF framework improves the VAE-based scheme by stacking its autoencoding transform (Section V-A). Secondly, we investigate the effect of the quality enhancement network on ANFIC and its VAEbased counterpart (Section V-B). Thirdly, we discuss the effect of imposing different regularization strategies on x 2 (Section V-C). Fourthly, we analyze the inner workings of ANFIC by visualizing the output of each autoencoding transform in both spatial and frequency domains (Section V-D). Fifthly, we study the compression performance of ANFIC across low and high rates (Section V-E). Finally, we extend ANFIC to support variable rate compression and compare its performance with the other baselines (Section V-F). Unless otherwise specified, Kodak dataset is used for ablation experiments.

A. Number of Autoencoding Transforms
To see the rate-distortion benefits of stacking autoencoding transforms, we compare between the VAE-based scheme [2] and ANFIC with a varied number of autoencoding transforms. It is important to note that the VAE-based scheme can be interpreted as one-step ANFIC (see Section III-A). For a fair comparison, the VAE-based scheme (which is termed "NIPS'18+GMM" and is modified from [2] by additionally including Gaussian mixture-based entropy coding and the quality enhancement network [12]) and ANFIC share the same autoencoding architecture, entropy coding scheme, and quality enhancement network. To keep the model size comparable, the channel number of every autoencoding transform in ANFIC is set to 128 (See Fig. 4), while that of the VAE-based counterpart is 192. This ensures that ANFIC with two autoencoding transforms (the main setting used throughout this paper) has a similar model size to the VAE-based one. Nevertheless, when the number of autoencoding transforms increases beyond two, the model size of ANFIC increases linearly.
From Fig. 8, it is seen that increasing the number of autoencoding transforms from one layer (VAE-based) to two layers (Ours 2-step) improves the rate-distortion performance significantly. However, the gain diminishes sharply when the number goes beyond two. We thus choose two autoencoding transforms as our default setting.
A side experiment shows that increasing the channel number (i.e. the L value in Fig. 4) of the autoencoding transform from 128 to 192 improves the BD-rate saving only marginally by 1.1%. The channel number is defaulted to 128 for lower complexity and fair comparison. Fig. 9 shows the effect of the quality enhancement network (as a post-processing network) on the rate-distortion performance of ANFIC and the VAE-based scheme [2]. In addition to the default quality enhancement network from [12], we experiment with another popular one, known as MCNet [1], which is often used in the end-to-end learned video codecs to enhance the quality of the motion-compensated frame [1]. The two quality enhancement networks have similar model sizes. The major difference between them is that the default one [12] does not have striding and pooling operations, whereas MC-Net [1] has a U-net structure, where the resolution of the feature maps shrinks first and stretches later.

B. Effect of Quality Enhancement Networks
We observe that ANFIC benefits more from the use of the default quality enhancement network [12], which boosts the BD-rate saving of ANFIC by 6.6% as compared to 3.5% with the NIPS'18+GMM (VAE-based with default quality enhancement network [12]) scheme [2]. This suggests that ANFIC literally separates the image transformation and the (quantization) error compensation into two orthogonal parts. The former is addressed by invertible autoencoding transforms while the latter relies on the quality enhancement network. The   fact that the feature extraction and the image reconstruction in ANFIC have to go through the same invertible coupling layers make it difficult to learn autoencoding transforms that can handle well both image representation and error compensation. This however is not the case with the NIPS'18+GMM (VAE- Fig. 9: Rate-distortion performance with and without the quality enhancement network. based) scheme, where the analysis and the synthesis transforms do not share the same network. Usually, the synthesis transform can learn to compensate partially for quantization errors. As such, the gain from the quality enhancement network becomes limited when the synthesis network is already capable enough.
From Fig. 9, it is also seen that the default quality enhancement network [12] shows better rate-distortion performance than MCNet [1], especially at lower rates. This may be attributed to the fact that the striding and pooling of MCNet [1] could cause the loss of some spatial information. In any case, ANFIC with either quality enhancement network outperforms NIPS'18+GMM. Fig. 10 compares the rate-distortion curves for different regularization strategies imposed on x 2 , including (1) weak regularization (λ 1 = 0.01 * λ 2 ) with the L 2 norm (the proposed method), (2) weak regularization (λ 1 = 0.01 * λ 2 ) with the L 1 norm, (3) heavy regularization (λ 1 = 1 * λ 2 ) with the L 2 norm, and (4) no regularization (λ 1 = 0). It can be observed that weak regularization with either the L 2 norm or L 1 norm achieves the best rate-distortion performance, presenting 15.3% BD-rate reductions. Heavy regularization with the L 2 norm, however, degrades the rate-distortion performance, because the regularization loss is weighted equally as the reconstruction loss. No regularization, interestingly, shows marginally worse rate-distortion performance (15.2% BD-rate reduction) than the weak regularization with the L 2 norm (the proposed method).

C. Effect of x 2 Regularization
The fact that no regularization shows marginal impact on the final rate-distortion performance has partially to do with our setting x 2 to 0 for reconstruction during training. Recall that the mapping between the input (x, e z , e h ) and the latent representation (x 2 , z 2 , h 2 ) is invertible (See Fig. 3a). In the absence of quantization, using (0, z 2 , h 2 ) in place of (x 2 , z 2 , h 2 ) for decoding while ensuring the invertibility by minimizing the reconstruction loss d(x,x) would compel x 2 to approximate noughts during encoding without any additional regularization. The same trend carries roughly over to the case when x 2 , z 2 , h 2 are quantized. We however notice that imposing weak regularization on x 2 during encoding will make the training more stable.  and what information is captured by the corresponding latent code z i , i = 1, 2 in each step. Additionally, the corresponding signal spectra in frequency domain are presented to understand the system response of every autoencoding transform. For better visualizing the evolution of signals, we extend the architecture in Fig. 4 to three-step ANFIC, with the final outputs being x 3 and z 3 (instead of x 2 and z 2 as depicted in Fig. 4). Also presented in this figure are the decoder outputs {µ dec πi } 3 i=1 of the autoencoding transforms (see Eq.

D. Visualization of Autoencoding Transforms
(3) and Fig. 4), which reveal the information captured by the latent code {z i } 3 i=1 . As an example, the first autoencoding transform converts the image x into the latent code z 1 , which is then decoded as µ dec π1 to be subtracted from x. Hence, µ dec π1 stands for an estimate of x that is derived from the latent z 1 .
From left to right in the top two rows, one can see that the high-frequency details of the input image x are filtered out in successive autoencoding transforms, arriving at a residual image x 3 with little high-frequency information (see the subfigure in the top-right corner). As such, the autoencoding transforms in ANFIC act as low-pass filters, where their cut-off frequency decreases with the increasing transform step in the feature extraction process. Because x 3 will be discarded during the reconstruction process, the remaining high-frequency details in x 3 will be lost completely. Thus, ANFIC is lossy.
The decoder outputs of the autoencoding transforms further shed light on how the latent code is transformed from e z into a form suitable for compression (i.e. e z → z 1 → z 2 → z 3 ). From left to right in the bottom two rows, we see that µ dec π1 (decoded from z 1 ) presents a rough estimate of the input x. Its spectrum looks similar to that of x, but is not exactly the same. We conjecture that µ dec π1 focuses more on the approximation of the high-frequency part of the input x. The corroborating fact is that when it is subtracted from x, the resulting output x 1 = x − µ dec π1 has relatively less highfrequency information. This becomes even more obvious in the following autoencoding transform, where µ dec π2 (decoded from z 2 ) addresses primarily the remaining mid-frequency part in x 1 ; as a result, the output x 2 = x 1 − µ dec π2 of the second transform becomes an even lower-frequency signal. In the end, the latent code z 3 , which will be compressed into the bitstream, only needs to represent a low-pass filtered version of the original input, which is relatively easy to compress. The reconstruction process updates a zero image in x 3 by those decoder outputs in reverse order (i.e. µ dec π3 → µ dec π2 → µ dec π1 → x), to recover the low-frequency, mid-frequency, and high-frequency details of the input x step-by-step. Fig. 12 further visualizes x 2 , z 2 , andẑ 2 at different bit rates ranging from 0.135bpp to 1.194bpp. It is seen that more residuals appear in x 2 at low rates than at high rates, suggesting that setting x 2 to a zero image at low rates would introduce more distortion than at high rates. As for z 2 andẑ 2 , because a fixed, uniform quantization step size, i.e. 1, is used for all the rate points, the MSE between z 2 andẑ 2 does not change significantly. However, the network learns to adjust the variance of z 2 in order to control the signal-to-noise ratio in the latent space. We see that the higher the bit rate is, the more information is captured by z 2 ; namely, z 2 tends to have larger variances at high rates. All in all, the information captured by x 2 decreases with the increasing bit rate, whereas that by z 2 increases accordingly.

E. Compression Performance across Low and High Rates
This study investigates the compression performance of ANFIC over a wide range of bit rates. It is reported in [11], [12] that most VAE-based compression schemes suffer from the autoencoder limitation; that is, the reconstruction by the autoencder is generally lossy, even without quantization. As a result, it is difficult for a VAE-based model to support efficient compression over a wide rage of bit rates without changing the network architecture, for example, by adjusting the number of channels. ANFIC, although being a flow-based model, is lossy due to discarding the high-frequency information in the residual image x 2 (see Fig. 4) for reconstruction. Fig. 13 compares ANFIC with two state-of-the-art VAEbased schemes over a wide range of bit rates. In particular, ANFIC has the same number of channels (i.e. 320 channels) in latent space as NIPS'18 [2], whereas CVPR'20 [5] has only 192 channels yet with a larger model size. We see that our ANFIC (w/o transmittingx 2 ) matches the performance of VTM closely from extremely low-rate compression up to perceptually lossless compression, while the two VAE-based schemes tend to fall short of VTM and even BPG at high rates. The reason why ANFIC is able to work well across low and high rates are two-fold: (1) the ANF-based backbone is fully invertible, and (2) our training strategies, which require x 2 to approximate noughts in the feature extraction process and use noughts exactly for x 2 during reconstruction, force the image latentẑ 2 and its hyperpriorĥ 2 to capture as much information about the input x as possible (see Fig. 4).
To further study the invertibility of ANFIC by additionally encoding x 2 , we model the distribution of the quantized x 2 , denoted byx 2 = x 1 − µ dec π2 (ẑ 2 ) , by the convolution of a Gaussian and a uniform distribution. For better coding efficiency, the distribution is conditional onẑ 2 : p(x 2 |ẑ 2 ) = N (0, σ dec π2 (ẑ 2 ) 2 ) * U(−0.5, 0.5) A closer look at the rate-distortion performance w/o and w/ transmittingx 2 in Fig. 13 reveals that (1) at lower rates, transmittingx 2 shows worse rate-distortion performance than not transmittingx 2 , and that (2) at higher rates, transmittinĝ x 2 helps mitigate the quality gap between lossy and (mathematically) lossless compression. In particular, not transmittinĝ x 2 puts a limit on the highest achievable reconstruction quality (i.e. the rate-distortion curve plateaus after 6bpp). The second observation is in line with the invertibility property of ANF. Focusing on lossy image compression, we opt for not transmittingx 2 in this paper. However, how to adapt ANFIC to support mathematically lossless coding is an interesting open issue that is among our future work.

F. Variable Rate Compression
Recognizing that ANFIC can work well over a wide range of bit rates, we take one step further to adapt ANFIC to variable rate compression with a single model. To this end, we implement the notion of the conditional convolution in [17], replacing every convolutional layer with one that is conditional on the λ 2 (see Eq. (12)). The conditional convolution layer applies an affine transformation to every feature map, with the affine parameters derived from a network conditional on the rate parameter λ 2 . For the experiment, we train a single ANFIC model using 5 distinct λ 2 values {0.1, 0.05, 0.02, 0.01, 0.005}. The training objective is an extension of Eq. (12) by substituting different λ 2 's into Eq. (12) and averaging over these variants. Fig. 14 shows the rate-distortion comparison of the state-of-the-art variable rate models, including VVC, BPG, ICCV'19 [17], TMAPI'20 [12], and ANFIC (ours). Compared with our multi-model setting, our single-model setting performs comparably well, with slightly increased rate saving due to training variance. It also shows comparable performance to VTM across the 5 rate points, but outperforms significantly the other learning-based methods in single-model mode.
VI. CONCLUSION In this paper, we propose an ANF-based image compression system (ANFIC). It is motivated by the fact that VAE, which forms the basis of most end-to-end learned image compression, is a special case of ANF and can be extended by ANF to offer greater expressiveness. ANFIC is the first work that introduces VAE-based compression in a flow-based framework, enjoying the benefits of both approaches. Experimental results show that ANFIC performs comparably to or better than the stateof-the-art learned image compression and is able to offer a wide range of quality levels without changing the network architecture. Furthermore, its variable rate version shows little performance degradation. Flow-based models are relatively new to learned image compression. We believe there remains widely open space for further research; for example, how to achieve mathematically lossless coding with ANFIC is yet to be addressed.