DeepCABAC: A Universal Compression Algorithm for Deep Neural Networks

The field of video compression has developed some of the most sophisticated and efficient compression algorithms known in the literature, enabling very high compressibility for little loss of information. Whilst some of these techniques are domain specific, many of their underlying principles are universal in that they can be adapted and applied for compressing different types of data. In this work we present DeepCABAC, a compression algorithm for deep neural networks that is based on one of the state-of-the-art video coding techniques. Concretely, it applies a Context-based Adaptive Binary Arithmetic Coder (CABAC) to the network's parameters, which was originally designed for the H.264/AVC video coding standard and became the state-of-the-art for lossless compression. Moreover, DeepCABAC employs a novel quantization scheme that minimizes the rate-distortion function while simultaneously taking the impact of quantization onto the accuracy of the network into account. Experimental results show that DeepCABAC consistently attains higher compression rates than previously proposed coding techniques for neural network compression. For instance, it is able to compress the VGG16 ImageNet model by x63.6 with no loss of accuracy, thus being able to represent the entire network with merely 8.7MB. The source code for encoding and decoding can be found at https://github.com/fraunhoferhhi/DeepCABAC.


I. INTRODUCTION
It has been well established that deep neural networks excel at solving many complex machine learning tasks [1], [2]. Their relatively recent success can be attributed to three phenomena: 1) access to large amounts of data, 2) researchers having designed novel optimization algorithms and model architectures that allow to train very deep neural networks, 3) the increasing availability of compute resources [1]. In particular, the latter two allowed machine learning practitioners to equip neural networks with an ever-growing number of layers and, consequently, to consistently attain state-of-the-art results on a wide spectrum of complex machine learning tasks.
However, this has triggered an exponential growth in the number of parameters these models entail over the past years  [3]. Trivially, this implies that the models are becoming more and more complex in terms of memory. This can become very problematic since it does not only imply higher memory requirements, but also slower runtimes and high energy consumption [4]. In fact, IO operations can be up to three orders of magnitude more expensive than arithmetic operations. Moreover, [3] show that the memory-energy efficiency trends of most common hardware platforms are not able to keep up with the exponential growth of the neural networks' sizes, thus expecting them to be more and more power hungry over time.
In addition, there has also been an increasing demand on deploying deep models to resource constrained devices such as mobile or wearable devices [5]- [7], as well as on training deep neural networks in a distributed setting such as in federated learning [8]- [10], since these approaches have direct advantages with regards to privacy, latency and efficiency issues. High memory complexity greatly complicates the applicability of neural networks for those use cases, in particular for the federated learning case since the parameters of the networks are transmitted through communication channels with limited capacity.
Model compression is one possible paradigm to solve this problem. Namely, by attempting to maximally compress the information contained in the network's parameters we automatically leave only the bits that are necessary for solving the task. Thus, in principle, the memory complexity of deep models should only increase with the complexity of the learning task and not with its number of parameters 1 . In addition, model compression has direct practical advantages such as reduced communication and compute cost [12]- [14]. In fact, the Moving Picture Expert Group (MPEG) of the International Organization of Standards (ISO) has recently issued a call on neural network compression [15], which stresses the relevance of the problem and the broad interest by the industry to find practical solutions.

A. Entropy coding in video compression
The topic of signal compression has been long studied and highly practical and efficient algorithms have been designed. State-of-the-art video compression schemes like H.265/HEVC [16] employ efficient entropy coding techniques that can also be used for compressing deep neural networks. Namely, the Context-based adaptive binary arithmetic coding (CABAC) engine [17] provides a very flexible interface for entropy coding that can be adapted to a wide range of applications. It is optimized to allow high throughput and a high compression ratio at the same time. In particular, the transform coefficient coding part of H.265/HEVC contains many interesting aspects that might be suitable for compressing deep neural network.
Hence, it appears only natural to try to adapt current state-ofthe-art compression techniques such as CABAC to deep neural networks and accordingly compress them.

B. Contributions
Our contributions can be summarized as follows: 1) We adapt CABAC for the task of neural network compression. To the best of our knowledge, we are the first in applying state-of-the-art coding techniques from video compression to deep neural networks. 2) We quantize the parameters of the networks by minimizing a generalized form of a rate-distortion function which takes the impact of quantization on the accuracy of the network into account. 3) In our experiments we show that DeepCABAC is able to attain very high compression ratios and that it consistently attains a higher compression performance than previously proposed coders.

C. Outline
In section II we start by reviewing some basic concepts from information theory, in particular from source coding theory. We also highlight the main difference between the classical source coding and the model compression paradigms in subsection II-D. Subsequently, we proceed by explaining DeepCABAC in section III. In section IV we provide a comprehensive review of the related work on neural network compression. Finally, we provide experimental results and a respective discussion in section V.

II. SOURCE CODING
Source coding is a subfield of information theory that studies the properties of so called codes. These are mappings that assign a binary representation and a reconstruction value to a given input element. Figure 1 depicts their most common structure. They are comprised of two parts, an encoder and a decoder. The encoder is a mapping that assigns a binary string of finite length b to an input element w. In contrast, the decoder assigns a reconstruction value q to the corresponding binary representation. We will also sometimes refer to q as a quantization point. Furthermore, it is assumed that the output elements b and q of the code C are elements of finite countable sets, and that there is a one-to-one correspondence between them. Therefore, without loss of generality, we can decompose the encoder into a quantizer and a binarizer, where the former maps the input to an integer value Q(w) = i ∈ Z, and the latter maps the integers to their corresponding binary representation B(i) = b. Analogously for the decoder. Naturally, Fig. 1: The general structure of codes. Firstly, the encoder maps an input sample w from a probability source P (w) to a binary representation b by a two-step process. It quantizes the input by mapping it to an integer i = Q(w). Then, the integer is mapped to its corresponding binary representation b = B(i) by applying a binarization process. The decoder functions analogously, it maps the binary representation back to its integer value by applying the inverse B −1 (b) = i, and subsequently it assigns a reconstruction value (or quantization point) Q −1 (i) = q to it. We stress that Q −1 does not have to be the inverse of Q.
it follows that the binarizer is always a bijective map, thus We also distinguish between two types of codes, the so called lossless codes and lossy codes. They respectively correspond to the cases where Q is either bijective or not, thus, the latter implies that information is lost in the coding process. Therefore, we stress that the map Q −1 does not necessarily have to be the inverse of Q! After establishing the basic definition of codes, we will now formalize the source coding problem. In simple terms, source coding studies the problem of finding the code that maximally compresses a set of input samples, while maintaining the error between the input and reconstruction values under an error tolerance constraint. Notice that the problem is probabilistic in its nature since it implicitly assumes that the decoder has no access to the element values being encoded. Moreover, the input values themselves may come from an unknown source distribution. Hence, we denote with P Enc (w) the encoders probability model of w, and with P Dec (q) (≡ P Dec (b) ≡ P Dec (i)) the decoders probability model of q (or equivalently b and i). It is important to stress that both models do not have to coincide, thus P Enc (Q(w)) = P Enc (i) ≡ P Dec (i). Furthermore, we will assume that the encoder's probability model follows the true underlying distribution of the input source, and therefore we will simply write P Enc (w) ≡ P (w).
Thus, the source coding problem can be formulated more precisely as follows: let W ⊂ R n be a given input set and let P (w) be the probability of an element w ∈ W being sampled. Then, find a code C * that where b = (B •Q)(w), q = (Q −1 •Q)(w). D is some distance measure and L C the length of the binary representation b. We will sometimes refer to L C (·) as the code-length of a sample, and with D the distortion between w and q. E P [·] denotes expectations as taken by the probability distribution P . λ ∈ R is the Lagrange multiplier that controls the trade-off between the compression strength and the error incurred by it. Minimization objectives of the form (1) are called ratedistortion objectives in the source coding literature. However, solving the rate-distortion objective for a given input source is most often NP-hard, since it involves finding optimal quantizers Q, binarizers B and reconstruction values Q −1 from the space of all possible maps. However, concrete solutions can be found for special cases, in particular in the lossless case. In the following we will review some of the fundamental theorems of source coding theory and introduce state-of-the-art coding algorithms that produce binary representations with minimal redundancy.

A. Lossless coding
Lossless coding implies that q = (Q −1 • Q)(w) = w ∀w. Thus, D(w, q) = 0 ∀w ∈ W in (1) and the rate-distortion objective simplifies into finding a binarizer B * that maximally compresses the input samples. Hence, throughout this section we will equate the general code C with the binarizer B and refer to it accordingly. Moreover, we will also assume that the decoder's probability model equals the encoder's, thus P Enc = P Dec . In the next subsection we discuss the case when the latter property does not apply.
Information theory already makes concrete statements regarding the minimum information contained in a probability source. Namely, Shannon in its influential work [18] stated that the minimum information required to fully represent a sample w that has probability P (w) is of − log 2 P (w) bits. Consequently, the entropy H P (W) = w∈W −P (w) log 2 P (w) states the minimum average number of bits required to represent any element w ∈ W ⊂ R n . This implies that whereL C (W) = w∈W P (w)L C (w) is the average codelength that any code C assigns to each element w ∈ W. Eq. (2) is also referred as the fundamental theorem of lossless coding.
Fortunately, from the source coding literature [19] we know of the existence of codes that are able to reach average codelength of up to only 1 bit of redundancy to the theoretical minimum. That is, Moreover, we even know how to build them. Before we start discussing in more detail some of these codes we want to recall an important property of joint probability distributions. Namely, due to their sequential decomposition property, we can express the minimal information entailed in the output sample w ∈ R n of a joint probability distribution P (w) sequentially as That is, we can always interpret a given input vector as an n-long random process and encode its outputs sequentially. As long as we know the respective conditional probability distributions, we can optimally encode the entire sequence. Respectively, we denote with w j the scalar value of the j-th dimension of w (or equivalently j-th output of the random process). Also, we denote with W s the set of possible scalar inputs, where w j ∈ W s , ∀j.
1) (scalar) Huffman coding: One optimal code is the wellknown Huffman code [20]. It consists of building a binary tree such that each input sample w is associated with one of the leaves of the tree. Thus, each w can be associated with the sequence of binary decisions that traverse the tree from its root point. The main idea is then to build the tree in such a manner that shorter paths are associated to more probable samples w. Huffman successfully proved that this code satisfies (3). We provide a pseudocode of the encoding and decoding process in the appendix (see algorithms 3, 1, and 2).
However, Huffman codes can be very inefficient in practice since the Huffman-tree grows very quickly for large input dimensions n. Therefore, most often scalar Huffman codes are used instead. Scalar Huffman codes do only consider 1dimensional inputs, and do accordingly encode each sample from the n-long random process. However, these codes are suboptimal in that they produce redundant binary representations and do therefore not satisfy (3). Concretely, they produce average code-lengths of where now P (w j ) is the probability of a scalar output w j and L SH (·) is the average code-length produced by the scalar Huffman code. Moreover, they are limited to stationary processes since they do not take conditional dependencies into account, which could further reduce the average code-length.
2) Arithmetic coding: A concept that approaches the joint entropy H(W) of eq. (3) in a practical and efficient manner is arithmetic coding. It consists of expressing a particular sequence of samples w 0 , w 1 , ..., w n−1 of an n-long random process as a so called coding interval. An overview of the idea is given in the following.
Let [L j , L j + R j ) be the coding interval before encoding symbol w j and let L 0 = 0 and R 0 = 1. Encoding of a symbol w j corresponds to deriving a coding interval [L j+1 , L j+1 + R j+1 ) from the previous interval [L j , L j + R j ) as follows. Subdivide [L j , L j + R j ) into one subinterval for each element w j of W s so that the interval width is given as for a given sequence of (already sampled) values w j−1 , w j−2 , ..., w 0 , and arrange the subintervals so that they are non-overlapping and adjacent. The subinterval associated with the sample w j to be encoded becomes the new coding interval [L j+1 , L j+1 + R j+1 ). Encoding of n symbols yields the coding interval [L n , L n + R n ) and the sequence of symbols w 0 , w 1 , ..., w n−1 can be reconstructed (in the decoder) when an arbitrary value inside of this coding interval is known. Figure 2 exemplifies this procedure for a binary random process. Interestingly, the width of the coding interval R n = P (w 0 , w 1 , ..., w n−1 ) equals the probability of sequence w 0 , w 1 , ..., w n−1 . As the minimum achievable code length for encoding of the n symbols is known to be − log 2 (R n ), the location of interval [L n , L n + R n ) needs to be signaled to the decoder in a way so that the number of written bits gets as close to − log 2 (R n ) as possible. The basic encoding principle is as follows. Derive an integer k so that holds. Subdivide the unit interval [0, 1) into 2 k (adjacent and non-overlapping) subintervals [q2 k , (q + 1)2 k ) of width 2 k . Equation (4) guarantees that one of the intervals [q2 k , (q + 1)2 k ) is fully contained in the coding interval (regardless of the exact location L n of interval [L n , L n + R n )) and if the decoder knows this interval, it can unambiguously identify [L n , L n + R n ). Consequently, the index q identifying this interval is written to the bitstream using k bits. Equation (4) can be rewritten as which shows that the ideal arithmetic coder only requires up to two bits more than the minimum possible code length for a sequence of length n.

B. Universal coding
In the previous subsection we learned that there exist codes that are able to produce binary representations of (almost) minimal redundancy (e.g. arithmetic codes). However, recall that the decoder has to know the joint probability distribution of the input source in order to build the most optimal binary representation. However, in most practical situations the decoder has no prior knowledge about it. Hence, in such cases, we have to rely on so called universal codes. They basically apply the following principle: 1) start with a general, dataindependent probability model P Dec , 2) update the model upon seeing incoming samples, 3) encode the input samples with regards to the updated probability model.
Thus, the theoretical minimum of universal codes is upper bounded by the decoder's probability estimate. Concretely, let  is a universal lossless codec that encodes an n-long sequence of 1-dimensional values by: 1) representing each unique value by a binary string that corresponds to traversing a particular path on a predefined decision tree, 2) assigning to each decision (or bin) a probability model (or context models) and updating these upon encoding/decoding data, and 3) applying a binary arithmetic coder in order to encode/decode each bin.
P Dec be the decoder's estimate of the input's probability model, then the minimum average code-length that can be achieved is with H P,PDec (W) = − w∈W P (w) log 2 P Dec (w) being the cross-entropy and D KL the Kullback-Leibler divergence. Hence, a lossless code can only create binary representations with minimal redundancies iff its decoder's probability model is the same as the input sources. In other words, the better its estimate is, the better it can encode the input samples. An example of a universal lossless code is the so called two-part Huffman code. Given a set of samples to be encoded, it firstly makes an estimate of their empirical probability mass distribution (EPMD) and, subsequently, it encodes the samples with regards to it. However, it has the natural caveat that the estimate needs to be encoded as well, which may add a significant number of bits in many practical situations. Moreover, as we already discussed in the previous subsection, Huffman codes also come with a series of undesired properties that make it very inefficient for cases where fast adaptability and coding efficiency is required [19].
In general, a universal lossless code should have the following desiderata: • Universality: The code should have a mechanism that allows it to adapt its probability model to a wide range of different types of input distributions, in a sampleefficient manner. • Minimal redundancy: The code should produce binary representations of minimal redundancy with regards to its probability estimate. • High efficiency: The code should have high coding efficiency, meaning, that encoding/decoding should have high throughput. 1) CABAC: Context-based Adaptive Binary Arithmetic Coding is a form of universal lossless coding that fulfils all of the above properties, in that it offers a high degree of adaptation, optimal code-lengths, and a highly efficient implementation. It was originally designed for the video compression standard H.264/AVC [17], but it is also an integral part of its successor H.265/HEVC. It is well known to attain higher compression performance as well as higher throughput as compared to other entropy coding methods [21]. In short, it encodes each input sample by applying the following three stages: 1) Binarization: Firstly, it predefines a series of binary decisions (also called bins) under which each unique input sample element (or symbol) will be uniquely identified. In other words, it builds a predefined binary decision tree where each leaf identifies a unique input value. 2) Context-modeling: Then, it assigns a binary probability model to each bin (also named context model) which is updated on-the-fly by the local statistics of the data. This enables CABAC to model a high degree of different source distributions. 3) Arithmetic coding: Finally, it employs an arithmetic coder in order to optimally and efficiently code each bin, based on the respective context model. Notice that, in contrast to two-part Huffman codes, CABACs encoder does not need to encode its probability estimates, since the decoder is able to analogously update its context models upon sequentially decoding the input samples. Codes that have this property are called backward-adaptive codes. Moreover, its able to take local correlations into account, since the context models are updated by the local statistics of the incoming input data.

C. Lossy coding
In contrast to lossless coding, information is lost in the lossy coding process. This implies that the quantizer Q is non invertible, and therefore ∃w : D(w, q) = 0. An example of a distortion measure may be the mean-squared error D(w, q) = ||w − q|| 2 2 , but we stress that other measures can be considered as well (which will become apparent in section III).
The infimum of the rate-distortion objective (1) ∀λ is referred to as the rate-distortion function in the source coding literature [19], and it represents the fundamental bound on the performance of lossy source coding algorithms. However, as we have already discussed above, finding the most optimal code that follows the rate-distortion function is most often NP-hard, and can be calculated only for very few types and/or special cases of input sources. Therefore, in practice, we relax the problem until we formalize an objective that we can solve in a feasible manner.
Firstly, we fix the binarization map B by selecting a particular (universal) lossless code and condition the minimization of (1) on it. That is, now we only ask for the quantizer Q, along with its reconstruction values Q −1 , that minimize the respective rate-distortion objective. Secondly, we always assume that we encode an n-long 1-dimensional random process. Then, objective (1) simplifies to: given a lossless code ∀j ∈ {0, ..., n − 1}, where q i ∈ Q s := {q 0 , q 1 , ..., q K−1 } ⊂ R and K = |Q s | < |W s | = n.
For instance, if we choose B such that it assigns a binary representation of fixed-length to all w j , then the minimizer of (6) can be found by applying the K-Means algorithm.
The minimizers of (6) are called scalar quantizers, since they measure the distortion independently for each input sample. In contrast, vector quantizers are those that result from minimizing (6) when grouping a sequence of input samples together and measuring the distortion in the respective vector space. It is well known that the infimum of scalar codes are fundamentally more redundant than vector quantizers. Nevertheless, due to the associated complexity of vector quantizers, it is more common to apply scalar quantizers in practice. Moreover, the inherent redundancy of scalar quantizers is mostly negligible for most practical applications [19].
We also want to stress that although the distortion in (6) is measured independently for each sample, the binarization b j (and consequently the respective code-length) of each sample can still depend on the other samples by taking correlations into account.
1) Scalar Lloyd algorithm: An example of an algorithm that finds a local optimum is the Lloyd algorithm. It approximates the average code-length of the quantized samples q j = (Q −1 • Q)(w j ) with the entropy of their empirical probability mass distribution (EPMD). Thus, it substitutes the code-length in (6) by L C (b j ) = − log 2 P EPMD (q j ) and applies a greedy algorithm in order to find the most optimal quantizer Q and quantization points Q −1 that minimize the respective objective. A pseudocode can be seen in the appendix (see algorithm 4).
2) CABAC-based RD-quantization: If we are given a set of quantization points Q s and select CABAC as our posterior universal lossless code, then we can trivially minimize (6) by sequentially quantizing the input samples. In the video coding standards, the set of quantization points are predefined by the particular choice of quantization strength λ [16]. However, in the context of neural network compression we do not know of a good relationship between the quantization strength and the set of quantization points. In the next section III we describe how we tackled this problem.

D. Model compression vs. source coding
So far, we have reviewed some fundamental results of source coding theory. However, in this work, we are rather interested in the general topic of model compression. There is a fundamental difference between both paradigms. Namely, now we are more interested in the predictive performance of the resulting quantized model rather than the distance between the quantized and original parameters. Figure 4 highlights this distinction. We will now formalize the general model compression paradigm for the supervised learning setting. However, the problem can be analogously formulated for other learning tasks.
Firstly, we assume that we are given only one model sample with n real-valued weight parameters (thus, here the input space is equivalent to the one discussed above). In addition,

< l a t e x i t s h a 1 _ b a s e 6 4 = " k / + n h o M z B i a Q F 0 r 6 r Y O x y s F s Z W w = " > A A A C A 3 i c b V D L S g N B E O z 1 G e N r 1 Z t e B o O Q H A y 7 U d C L E P T i M Q H z g G R d Z i e T Z M j s w 5 l Z J S w B L / 6 K F w + K e P U n v P k 3 T p I 9 a G J B Q 1 H V T X e X F 3 E m l W V 9 G w u L S 8 s r q 5 m 1 7 P r G 5 t a 2 u b N b l 2 E s C K 2 R k I e i 6 W F J O Q t o T T H F a T M S F P s e p w 1 v c D
R n e I U 3 5 9 F 5 c d 6 d j 0 V r w c l n j u E P n M 8 f r G u M 2 Q = = < / l a t e x i t > Q < l a t e x i t s h a 1 _ b a s e 6 4 = " T I w m k 8 R n e I U 3 5 9 F 5 c d 6 d j 0 V r w c l n j u E P n M 8 f r G u M 2 Q = = < / l a t e x i t > Q 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " F 7 / E p + 9 Q D + 9 I X r X

m a D U t m t u H O Q V e L l p A w 5 6 o P S V 2 8 Y s S T k C p m k x n Q 9 N 8 Z + S j U K J v m s 2 E s M j y m b 0 B H v W q p o y E 0 / n S e e k X O r D E k Q a f s U k r n 6 e y O l o T H T 0 L e T W U K z 7 G X i f 1 4 3 w e C m n w o
< l a t e x i t s h a 1 _ b a s e 6 4 = " A e p p X X n Z U h n r b P U z U F z 6 a 9 0 V l 3 s = " M c A L U e 6 0 t E 5 9 9 r 3 H j 9 i V C r L + j Z S C 4 t L y y v p 1 c z a + s b m l r m 9 0 5 B h L D C p 4 5 C F o u U j S R j l p K 6 o Y q Q V C Y I C n 5 G m 3 z 8 f + c 1 7 I < l a t e x i t s h a 1 _ b a s e 6 4 = " s e z U A c i f L C s P U 4 A N W R b U 4 n m B e g 4 = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H b R R I 9 E L x 4 h k U c C G z I 7 9 M L A 7 O x m Z t a E E L 7 A i w e N 8 e o n e f N v H G A P C l b S S a W q O 9 1 d Q S K 4 N q 7 7 7 e Q 2 N r e 2 d / K 7 h b 3 9 g 8 O j 4 v F J U 8 e p Y t h g s Y h V O 6 A a B Z f Y M N w I b C c K a R Q I b A X j + 7 n f e k K l e S w f z S R B P 6 I D y U P O q L F S f d Q r l t y y u w B Z J 1 5 G S p C h 1 i t + d f s x S y O U h g m q d c d z E + N P q T K c C Z w V u q n G h L I x H W D H U k k j 1 P 5 0 c e i M X F i l T 8 J Y 2 Z K G L N T f E 1 M a a T 2 J A t s Z U T P U q 9 5 c / M / r p C a 8 9 a d c J q l B y Z a L w l Q Q E 5 P 5 1 6 T P F T I j J p Z Q p r i 9 l b A h V Z Q Z m 0 3 B h u C t v r x O m p W y d 1 W u 1 K 9 L 1 b s s j j y c w T l c g g c 3 U I U H q E E D G C A 8 w y u 8 O S P n x X l 3 P p a t O S e b O Y U / c D 5 / A N J P j P I = < / l a t e x i t > j < l a t e x i t s h a 1 _ b a s e 6 4 = " s e z U A c i f L C s P U 4 A N W R b U 4 n m B e g 4 = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H b R R I 9 E L x 4 h k U c C G z I 7 9 M L A 7 O x m Z t a E E L 7 A i w e N 8 e o n e f N v H G A P C l b S S a W q O 9 1 d Q S K 4 N q 7 7 7 e Q 2 N r e 2 d / K 7 h b 3 9 g 8 O j 4 v F J U 8 e p Y t h g s Y h V O 6 A a B Z f Y M N w I b C c K a R Q I b A X j + 7 n f e k K l e S w f z S R B P 6 I D y U P O q L F S f d Q r l t y y u w B Z J 1 5 G S p C h 1 i t + d f s x S y O U h g m q d c d z E + N P q T K c C Z w V u q n G h L I x H W D H U k k j 1 P 5 0 c e i M X F i l T 8 J Y 2 Z K G L N T f E 1 M a a T 2 J A t s Z U T P U q 9 5 c / M / r p C a 8 9 a d c J q l B y Z a L w l Q Q E 5 P 5 1 6 T P F T I j J p Z Q p r i 9 l b A h V Z Q Z m 0 3 B h u C t v r x O m p W y d 1 W u 1 K 9 L 1 b s s j j y c w T l c g g c 3 U I U H q E E D G C A 8 w y u 8 O S P n x X l 3 P p a t O S e b O Y U / c D 5 / A N J P j P I = < / l a t e x i t > j < l a t e x i t s h a 1 _ b a s e 6 4 = " s e z U A c i f L C s P U 4 A N W R b U 4 n m B e g 4 = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H b R R I 9 E L x 4 h k U c C G z I 7 9 M L A 7 O x m Z t a E E L 7 A i w e N 8 e o n e f N v H G A P C l b S S a W q O 9 1 d Q S K 4 N q 7 7 7 e Q 2 N r e 2 d / K 7 h b 3 9 g 8 O j 4 v F J U 8 e p Y t h g s Y h V O 6 A a B Z f Y M N w I b C c K a R Q I b A X j + 7 n f e k K l e S w f z S R B P 6 I D y U P O q L F S f d Q r l t y y u w B Z J 1 5 G S p C h 1 i t + d f s x S y O U h g m q d c d z E + N P q T K c C Z w V u q n G h L I x H W D H U k k j 1 P 5 0 c e i M X F i l T 8 J Y 2 Z K G L N T f E 1 M a a T 2 J A t s Z U T P U q 9 5 c / M / r p C a 8 9 a d c J q l B y Z a L w l Q Q E 5 P 5 1 6 T P F T I j J p Z Q p r i 9 l b A h V Z Q Z m 0 3 B h u C t v r x O m p W y d 1 W u 1 K 9 L 1 b s s j j y c w T l c g g c 3 U I U H q E E D G C A 8 w y u 8 O S P n x X l 3 P p a t O S e b O Y U / c D 5 / A N J P j P I = < / l a t e x i t > j < l a t e x i t s h a 1 _ b a s e 6 4 = " s e z U A c i f L C s P U 4 A N W R b U 4 n m B e g 4 = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H b R R I 9 E L x 4 h k U c C G z I 7 9 M L A 7 O x m Z t a E E L 7 A i w e N 8 e o n e f N v H G A P C l b S S a W q O 9 1 d Q S K 4 N q 7 7 7 e Q 2 N r e 2 d / K 7 h b 3 9 g 8 O j 4 v F J U 8 e p Y t h g s Y h V O 6 A a B Z f Y M N w I b C c K a R Q I b A X j + 7 n f e k K l e S w f z S R B P 6 I D y U P O q L F S f d Q r l t y y u w B Z J 1 5 G S p C h 1 i t + d f s x S y O U h g m q d c d z E + N P q T K c C Z w V u q n G h L I x H W D H U k k j 1 P 5 0 c e i M X F i l T 8 J Y 2 Z K G L N T f E 1 M a a T 2 J A t s Z U T P U q 9 5 c / M / r p C a 8 9 a d c J q l B y Z a L w l Q Q E 5 P 5 1 6 T P F T I j J p Z Q p r i 9 l b A h V Z Q Z m 0 3 B h u C t v r x O m p W y d 1 W u 1 K 9 L 1 b s s j j y c w T l c g g c 3 U I U H q E E D G C A 8 w y u 8 O S P n x X l 3 P p a t O S e b O Y U / c D 5 / A N J P j P I = < / l a t e x i t >  (6), whereas in the model compression problem as in eq. (7). The main difference between the two paradigms lies in their measure of error, where the former is based on a distance measure between the unquantized and quantized parameters and the latter on the prediction performance of the quantized model. This results in different quantization schemes as solutions, which are displayed in the sketch. Different colors denote different parameter values. The different shapes correspond to different stages on the quantization procedure, with circles denoting the unquantized values, squares their respective integer representations and triangles their corresponding quantization points.
we assume a universal coding setting, where the decoder has no prior knowledge regarding the distribution of the parameter values. We argue that this simulates most real-world scenarios.
Let now (x, y) ∈ D be a set of data samples. Let further y ∼ P (y |x, w) denote the approximate posterior of the data, parameterized by w ∈ R n . For instance, P (y |x, w) may be a trained neural network model with parameters w. Finally, let B be a chosen and fix universal lossless code. Then, we aim to find a quantizer Q * that minimizes with y ∼ P (y |x, q) being outputs of the quantized model q = (Q −1 • Q)(w) and b = (B • Q)(w).
The first term in (7) expresses the minimization of the usual learning task of interest, whereas the second term explicitly expresses the code-length of the model. This minimization objective is well motivated from the Minimum Description Principle (MDL) [11]. However, finding the minimum of (7) is also most often NP-hard. This motivates further approximations where, as a result, one can directly apply techniques from the source coding literature in order to minimize the desired objective.
1) Relaxation of the model compression problem into a source coding problem: We may further assume that the given unquantized model has been pre-trained on the desired task and that it reaches satisfactory accuracies. Then, in such cases, it is reasonable to replace the first term in (7) by the KL-Divergence between the unquantized model P (y |x, w) and the respective quantized model P (y |x, q). That is, now we want to quantize our model such that its output distribution does not differ too much from its original version.
Furthermore, if we now assume that the output distributions do not differ too much from each other, then we can approximate the KL-Divergence with the Fisher Information Matrix (FIM). Concretely, with δw = q − w and Then, by substituting (8) in (7) we get the following minimization objective Objective (9) now follows the same paradigm as the usual source coding problem. However, with the peculiarity that now D(w, q) (approximately) measures the distortion of w and q in the space of output distributions instead the Euclidian space. The advantage of the rate-distortion objective (9) is that, after the FIM has been calculated, it can be solved by applying common techniques from the source coding literature, such as the scalar Lloyd algorithm.
However, minimizing (9) as well as estimating the FIM for deep neural networks usually requires considerable computational resources, and is most often infeasible for practical applications. Therefore, we further approximate it by only its diagonal elements (FIM-diagonals), which can be efficiently estimated (see appendix). As a result, (9) simplifies into (Q, Q −1 ) * = arg min ∀i ∈ {0, ..., n − 1}, which can be feasibly solved.
In the next section we will give a thorough description of our proposed coder. Its design complies with all desired properties that a coder for neural network compression should have.

III. DEEPCABAC
In light of the discussion in the previous section, we can highlight a set of desiderata that a coder for neural network compression should have.
• Minimal redundancy: State-of-the-art deep neural networks usually contain millions of parameters. Thus, any type of redundancy in the weight parameters may imply several additional MB being stored. Hence, the code should output a binary representation with minimal redundancy per weight element. Firstly, DeepCABAC scans the weight parameters of each layer of the network in row-major order. Then, it selects a particular hyperparameter β that will define the set of quantization points. Subsequently, it applies a quantizer on to the weight values that minimizes the respective weighted rate-distortion function (11). Then, it compresses the quantized parameters by applying our adapted version of CABAC. Finally, it reconstructs the network and measures the respective accuracy of it. The process is repeated for different hyperparameters β until the desired trade-off between accuracy and bit-size of the network is achieved.
• Universality: The code should be applicable to any type of incoming neural networks, without having to know their distribution a priori. Hence, the code should entail a mechanism that allows it to adapt to a rich number of possible parameter distributions. • High coding efficiency: The computational complexity of encoding/decoding should be minimal. In particular, the throughput of the decoder should be very high if performing inference on the compressed representation is desired. • Configurable error vs. compression strength: The coder should have a hyperparameter that controls the trade-off between the compression strength and the incurred prediction error. • High data efficiency: Minimizing (7) implies access to data. Hence, it is desirable that the coder finds a (local) solution with the least amount of data samples possible.

A. DeepCABAC's coding procedure
We propose a coding algorithm that satisfies all the above properties. We named it DeepCABAC, since it's based on applying CABAC on the networks quantized weight parameters. Figure 5 shows the respective compression scheme. It performs the following steps: 1) it extracts the weight parameters of the neural networks layer-by-layer in row-major order 2 . 2) Then, it selects a particular value β which defines the set of quantization points. 3) Subsequently, it quantizes the weight values by minimizing a weighted rate-distortion function, which implicitly takes the impact of quantization on the accuracy of the network into account. 4) Then, it compresses them by applying our adapted version of CABAC. 5) Finally, it reconstructs the network and evaluates the prediction performance of the quantized model. 6) The process is repeated for a set of hyperparameters β, until the desired accuracy-vs-size trade-off is achieved. This approach has several advantageous properties. Firstly, it applies CABAC to the quantized parameters and therefore we ensure that the code satisfies the desiderata 1-3. Secondly, by conducting the compression for a set of hyperparameters for the quantizer we can select the desired pareto-optimal solutions of the accuracy vs. bit-size plane, thus satisfying property 4. Finally, since only one evaluation of the model is required in the process, a significantly lower amount of data samples is required for the compression process than usually employed for training.
In the following we will explain in more detail the different components of DeepCABAC.

B. Lossless coder of DeepCABAC
Consider the weight distribution of the last fully-connected layer of the trained VGG16 model displayed in figure 6. As we can see, there is one peak near 0 and the distribution is asymmetric and monotonically decreasing on both sides. In our experience, all layers we have studied so far have weight distributions with similar properties. Hence, in order to accommodate to this type of distributions, we adopted the following binarization procedure.
Given a quantized weight tensor in its matrix form 3 , Deep-CABAC scans the weight elements in row-major order and binarizes them as follows: SigFlag SignFlag AbsGr(n)Flags 1) The first bit, sigFlag, determines if the weight element is a significant element or not. That is, it indicates if the weight value is 0 or not. This bit is then encoded using a binary arithmetic coder, according to its respective context model (color-coded in grey). The context model is initially set to 0.5 (thus, 50% probability that a weight element is 0 or not), but will automatically be updated to the local statistics of the weight parameters as DeepCABAC encodes more elements. 2) Then, if the element is not 0, the sign bit or signFlag is analogously encoded, according to its respective context model.
CABACs Estimation z z Fig. 6: Distribution of the weight matrix of the last layer of VGG-16 (trained on ImageNet) after uniform quantization over the range of values. In red is CABACs (possible) estimation of the distribution. The first n + 2 bits allow to adapt to any type of shape around 0 since they are encoded with regards to a context model. The remainder can only approximate the shape by a step-like distribution, since they are encoded with an Exponential-Golomb where the fixed-length parts are encoded without a context model.
3) Subsequently, a series of bits are analogously encoded, which determine if the element is greater than 1, 2, ..., n ∈ N (hence AbsGr(n)Flags). The number n becomes a hyperparameter for the encoder. 4) Finally, the remainder is encoded using an Exponential-Golomb 4 code [23], where each bit of the unary part are also encoded relative to their context-models. Only the fixed-length part of the code is not encoded using a context-model (color-coded in blue). For instance, assume that n = 1, then the integer −4 would be represented as 111101, or the 7 as 10111010. Figure 7 depicts an example scheme of the binarization procedure.
The first three parts of the binarization scheme empower CABAC to adapt its probability estimates to any shape distribution around the value 0 and, therefore, to encode the most frequent values with minimal redundancy. For the remainder values, we opted for the Exponential-Golomb code since it automatically assigns smaller code-lengths to smaller integer values. However, in order to further enhance its adaptability, we also encode its unary part with the help of context models. We left the fixed-length part of the Golomb code without context models, meaning that we approximate the distribution of those values by a uniform distribution (see figure 6). We argue that this is reasonable since usually the distribution of 4 To recall, the Exponential-Golomb code encodes a positive integer 2 k < i ≤ 2 k+1 by firstly encoding the exponent k using an unary code and subsequently the remainder r = i − 2 k in fixed-point representation. 3) Subsequently, a series of bits are encoded, which indicate if the weight value is greater equal than 1, 2, ..., n ∈ N (the so called AbsGr(n)Flag). 4) Finally, the remainder is encoded. The gray bits (also named regular bins) represent bits that are encoded using an arithmetic coder according to a context model. The other bits, the so called bypass bins, are encoded in fixed-point form. For instance, in the above diagram n = 1, and therefore 1 → 100 , −4 → 111101 or 7 → 10111010 .
large numbers become more and more flat, and it comes with the direct benefit of increasing the efficiency of the coder.

C. Lossy coder of DeepCABAC
After establishing CABAC as our choice of universal lossless code, now we aim to find the optimal quantizer that minimizes the objective stated in (7) (section II-D). To recall, this involves the optimization of two components (see figure  4): • Assignment: finding the quantizer Q that assigns the optimal set of integers to each weight parameter • Quant. points: finding the optimal quantization points q j = (Q −1 • Q)(w j ). Since neural networks usually rely on scalable, gradient-based minimization techniques in order to optimize their loss function, finding the quantizers that solve (7) becomes infeasible in most cases since Q is a non-differentiable map. Therefore, we opted for a simpler approach.
Firstly, we decouple the assignment map Q and the quantization points Q −1 from each other and optimize them independently. The quantization points then become hyperparameters for the quantizer, and their values are selected such that they minimize the loss function directly. This separation between Q and Q −1 was empirically motivated, since we discovered that the networks performance is significantly more sensitive to the choice of Q −1 than to the assignment Q. We discuss this in more detail in the experimental section.
1) The quantization points: Since finding the correct map Q −1 for a large number of points can be very complex, we constrain them to be equidistant to each other with a specific step-size ∆. That is, each point q k can be rewritten as to be q k = ∆I k with I k ∈ Z. This does not only considerably simplify the problem, but it does also encourage fixed-point representations which can be exploited in order to perform inference with lower complexity [24], [25].
2) The assignment: Hence, the quantizer has two configurable hyperparameters β = (∆, λ), the former defining the set of quantization points and the latter the quantization strength. Once a particular tuple is given, the quantizer Q β will then assign each weight parameter w i to its corresponding quantization point by minimizing the weighted rate-distortion function ∀i ∈ {0, ..., n − 1}, where L ik is the code-length of the quantization point q k at the weight w i as estimated by CABAC.
As previously mentioned, we perform a grid-search algorithm over the hyperparameters ∆ and λ in order to find the quantizer configuration that achieves the desired accuracy vs. bit-size trade-off. However, for that we need to define a predefined set of hyperparameter candidates to look for, in particular for the step-sizes ∆. In this work we considered two approaches for finding the set of step-sizes, which we denote as DeepCABAC-version1 (DC-v1) and DeepCABAC-version2 (DC-v2).
3) DeepCABAC-version1 (DC-v1) : In DC-v1 we firstly estimate the diagonals of the FIM by applying scalable Bayesian techniques. Concretely, we parametrize the network with a fully-factorized gaussian posterior and minimize the variational objective proposed in [26]. As a result, we obtain a mean µ j and a standard deviation σ j for each parameter, where the former can be interpreted as its (new) value (thus w i → µ i ) and the latter as a measure of their "robustness" against perturbations. Therefore, we simply replaced F i = 1/σ 2 i in (11). This is also well motivated theoretically since [27] showed that the variance of the parameters approximate the diagonals of the FIM for a similar variational objective. We also provide a more thorough discussion and a precise connection between them in the appendix.
After the FIM-diagonals have been estimated, we define the set of considered step-sizes as follows: where σ min is the smallest standard deviation and w max the parameter with highest magnitude value. S is then the quantizers hyperparameter, which controls the "coarseness" of the quantization points. By selecting ∆ in such a manner we ensure that the quantization points lie within the range of the standard deviation of each weight parameter, in particular for values S ≥ 0. Hence, we selected S to be S ∈ {0, 1, ..., 256}.
One advantage of this approach is that we can have one global hyperparameter S for the entire network, but each layer will still attain a different value for its step-size if we select one σ min per layer. Thus, with this approach we can adapt the step-size to the layer's sensitivity to perturbations. Moreover, the quantization will also take the sensitivity of each single parameter into account. 4) DeepCABAC-version2 (DC-v2): Estimating the diagonals of the FIM can still be computationally expensive since it requires the application of the backpropagation algorithm for several iterations in order to minimize the variational objective. Moreover, it only offers an approximation of the robustness of each parameter, and can therefore sometimes be misleading and limit the potential compression gains that can be achieved. Therefore, due to simplicity and complexity reasons, we also considered to directly try to find a good set of candidates ∆ ∈ {∆ 0 , ..., ∆ m−1 }. We do so by applying a first round of the grid-search algorithm while applying a nearest-neighbor quantization scheme (that is, for λ = 0). This allows us to identify the range of step-sizes that do not considerably harm the networks accuracy when applying the simplest quantization procedure. Then, we quantize the parameters as in eq. (11), but without the diagonals of the FIM (thus, F j = 1 ∀j).
Under a limited computational budget, this approach has the advantage that we can directly search for a more optimal set of step-sizes ∆ since we spare the computational complexity of having to estimate the FIM-diagonals. However, since DC-v2 considers only one global step-size for the entire network, it cannot adapt to the different sensitivities of the layers.
IV. RELATED WORK There has been a plethora of work focusing on the topic of neural network compression [12], [13]. In some way or another, all of them try to partially solve the general model compression objective (7) (section II-D). In the following we will thoroughly examine the currently proposed approaches and discuss some of their advantages and disadvantages.

A. Lossy neural network compression
Some of the insofar proposed approaches include: 1) Trained scalar quantization: These are methods that aim to minimize the model compression objective (7) by applying training algorithms that learn the optimal quantization map and reconstruction values. Inter alia, this includes sparsification methods [26], [28]- [33] which try to minimize the L 0 -norm of the network's parameters. Others attempted to find optimal binary or ternary weighted networks [34]- [37], or a more general set of (locally) optimal quantizers [38]- [42].
Although these methods are able to attain very high compression ratios, they are also very computationally expensive in that several training iterations over a large training set have to be performed in order to find the optimal quantized network. In contrast, DeepCABAC does not require any retraining, nor access to the full training dataset in order to be applicable.
2) Non-trained scalar quantization methods: Another line of work has focused on implicitly minimizing (7). They also rely on distance measures for quantizing the network's parameters [43]- [46]. In fact, these methods can be seen as special cases of (10), in that they either use different approximations of the FIM-diagonals or apply other minimization algorithms. To the best of our knowledge, mainly two quantizers are widely applied by the community, either the scalar uniform quantizers or the weighted scalar Lloyd algorithm. Usually, the former basically consist on uniformly spreading K ∈ N quantization points over the range of parameter values and then applying nearest-neighbor quantization on to them [38], [40], [45]. The latter consists of applying the scalar Lloyd algorithm in order to find the most optimal quantizer that minimizes a weighted rate-distortion objective (10). In particular, [45] considers the diagonal elements of the empirical average of the Hessian of the loss function, which has a close connection to the FIMdiagonals (see appendix for a comprehensive discussion).
Applying quantization methods that do not rely on retraining are significantly less computationally expensive. But their compression gains are heavily limited by the networks unquantized parameter distribution, since they rely on a distance measure for quantization. Moreover, as already mentioned, most of these methods do only implicitly take the impact on to the accuracy of the network into account. In contrast, DeepCABAC does explicitly take the accuracy of the network into account since its hyperparameters are optimized with regards to it.

B. Lossless neural network compression
In the field of lossless network compression, we are given an already quantized model and we want to apply a universal lossless code to its parameters in order to maximally compress it. Hence, this setting is entirely equivalent to the usual lossless source coding setting discussed in sections II-A and II-B, and therefore all of its theorems and results can be applied in a straight-forward manner. Nevertheless, most of the previous work did not apply state-of-the-art universal lossless compression algorithms to them. Instead, these are some of the most commonly used: 1) Fixed-length numerical representations: These methods reduce the bit-length representation of the parameter values after quantization [34]- [37], [47]- [52]. They usually have the advantage of immediately reducing the complexity for performing inference, however, at the expense of having a highly redundant network representation.
2) Scalar Huffman code: Others applied the scalar Huffman code on to quantized neural networks [45], [46]. However, as we have already discussed in section II-A, this code has several disadvantages compared to other state-of-the-art lossless codes such as arithmetic codes. Probably the most prominent one is that this code is suboptimal in that it incurs up to 1 bit of redundancy per parameter being encoded. This can be quite significant for large networks with millions of parameters. For instance, VGG16 [53] contains 138 million parameters, meaning that the binary representation of any quantized version of it may have about 17MB of redundancy if we encode it using the scalar Huffman code.
3) Compressed matrix representations: Most of the literature that sparsify the networks parameters aim to convert the resulting networks into a compressed sparse matrix representation, e.g., the Compressed Sparse Row (CSR) representation. These matrix data structures do not only offer compression gains, but also an efficient execution of the associated dot product algorithm [54]. Similarly, [14] proposed two novel matrix representation, the Compressed Entropy Row (CER) and Compressed Shared Elements Row (CSER) representations, that are provably more optimal than the CSR with regards to both, compression and execution efficiency when the networks parameters have low entropy statistics.
However, these matrix representations are also redundant in that they do not approach the reachable entropy limit (3) (section II-A). [38] attempted to extract some of the redundancies entailed in the CSR representations by applying a scalar Huffman code to its numerical arrays. However, this has again the same limitations that come by applying the scalar Huffman code.

C. Compression pipelines/frameworks
Among all different proposed approaches for deep neural network compression there is one paradigm that stands out in that very high compression gain can be achieved with it [26], [38], [40], [42], [55]. Namely, it consists on applying four different compression stages: 1) Sparsification: Firstly, the networks are maximally sparsified by applying a trained sparsification technique. 2) Quantization: Then, the non-zero elements are quantized by applying one of the non-trained quantization techniques. 3) Fine-tuning: Subsequently, the quantization points are fine-tuned in order to recover the accuracy loss incurred by the quantization procedure. 4) Lossless compression: Finally, the quantized values are encoded using a lossless coding algorithm. Hence, DeepCABAC is designed to enhance points 2 and 4. As we will see in the next section, DeepCABAC is able to considerably boost the attainable compression gains, surpassing previously proposed methods for steps 2 and 4.

V. EXPERIMENTS
In this section we benchmark DeepCABAC and compare it to other compression algorithms. We also design further experiments with the purpose to shed light on the effectiveness of its different components.

A. General compression benchmark
Here we benchmark the maximum compression gains attained by applying DeepCABAC. In order to assess its universality, we applied it to a wide set of pretrained network architectures, trained on different data sets. Concretely, we used the VGG16, ResNet50 and MobileNet-v1 architectures, trained on the ImageNet dataset, a smaller version of the VGG16 architecture trained on the CIFAR10 dataset 6 , which we denote as Small-VGG16, and the LeNet-300-100 and LeNet5 trained on MNIST.
In addition, we also applied DeepCABAC to pre-sparsified versions of these networks. For that, we employed the variational sparsification algorithm [26] to all networks, except for the VGG16 and ResNet50 due to the high computational complexity demanded by the method. The advantage of employing [26] is that we obtain the variances of each weight parameters as a byproduct of the methods output, thus being able to directly apply DC-v1 after the sparsification process finished. In the cases of the VGG16 and ResNet50 networks, we firstly applied the iterative sparsification algorithm [30]  on them and subsequently estimated their FIM-diagonals by minimizing the same variational objective proposed in [26] (see appendix for a more in comprehensive explanation). We compare the two versions of DeepCABAC, DC-v1 & DC-v2, against two previously proposed quantization schemes. Namely, similarly to [38], [45], [46], we applied the nearestneighbor quantization scheme on to the networks. In addition, we also applied the weighted Lloyd algorithm as proposed by [43], [45], [46]. As possible lossless compression candidates, we considered the scalar Huffman code, the code proposed by [38] which we denote CSR-Huffman, and the bzip2 [56] algorithm. See appendix for a more detailed explanation of the respective implementations. Table I shows the results. As one can see, DeepCABAC is able to attain higher compression gains on most networks as compared to the previously proposed coders. It is able to compress the pretrained by x18.9 and the sparsified models by x50.6 on average. In contrast, the Lloyd algorithm compresses the models by x13.6 and x47.3 on average, whereas uniform quantization only achieves x5.7 and x25.0 compression gains.

B. Assignment vs. quantization points
To recall, lossy quantization involves two types of mappings, the quantization map Q where input values are assigned to integers, and the reconstruction map Q −1 which assigns a quantization point to each integer. Hence, the following experiment aims to assess the effectiveness of these components individually.
For that, we selected a predefined set of step-sizes and subsequently quantized the parameters according to different quantization schemes. In this way we can attain insights into the compression gains attained only by the influence of the quantization map. Table II shows the average bit-sizes per parameter attained by applying different quantizers with the three-given stepsizes to the Small-VGG16 model. In order to decouple the lossless part from the quantization, the bit-sizes are calculated with regards to the entropy of the empirical probability mass distribution (EPMD) of the quantized models in the case of the Lloyd and uniform algorithms, since it marks the theoretical minimum for lossless codes that do not take correlations between the parameters into account. In contrast, since Deep-CABAC's quantizer is optimized explicitly under CABACs lossless coder, we calculated the average bit-size with regards to the total bit-size of the model as outputted by CABAC. We also want to stress that we chose the networks that resulted in having equal accuracies, thus, within the ±0.1 percentage point range from the accuracy attained after applying a uniform quantizer.
We attain many insights from table II. Firstly, notice how DeepCABAC's performance is very sensitive to the particular choice of the step-size. This is due to fact that, usually, the best compression performances are attained for small compression strengths λ ≈ 0 at high accuracies. Thus, DeepCABAC's quantization map behaves similarly to uniform quantization in those cases, and therefore it becomes sensitive to the particular choice of the step-size. Nevertheless, notice how DeepCABAC still always attains better compression performance than uniform quantization.
Secondly, notice how for small step-sizes DC-v1 outperforms DC-v2 and thus, makes better rate-distortion decisions. We attribute this to the property that DC-v1 takes the "robustness" to perturbations of each element during quantization into account. As we have already discussed in sections II-D and III (and in more detail in the appendix), the latter can be interpreted as minimizing an approximation to the desired MDL-loss function. However, since it is only an approximation, this only applies for small step-sizes and becomes more inaccurate for larger ones. Indeed, table II shows how DC-v2 attains similar results as DC-v1 as the step-size is increased, implying that the particular expression of the RD-function becomes more and more irrelevant.
These insights motivated the design of DC-v2 in the first place, since it is able to explore a larger set of step-sizes for the best accuracy vs. bit-size trade-offs. Indeed, as table I from the previous experiment shows, DC-v2 attains similar or even higher compression gains than DC-v1, in particular in the case of pretrained networks.

C. Lossless coding
In our last experiment we aimed to assess the efficiency of different universal lossless coders. For that, we quantized the Small-VGG network using three different quantizers, and subsequently compressed each of them using different universal lossless coders. More concretely, we quantized the model by applying DC-v2, the weighted Lloyd algorithm and the nearest-neighbor quantizer. We then applied the scalar Huffman code, the CSR-Huffman code [38], the bzip2 algorithm, and CABAC. Moreover, we also calculated the entropy of the quantized networks, as measured with regards to their empirical probability mass distribution (EPMD). The resulting bit-sizes are in table III. As one can see, CABAC is able to attain higher compression gains across all quantized versions of the network. Moreover, in some cases it is able to attain lower code-lengths than the entropy of the EPMD. These results are attributed to CABACs inherent capability to take correlations between the network's parameters into account. This property highlights its superiority as compared to the previously proposed universal lossless coders, e.g., scalar Huffman and CSR-Huffman, since their average code-lengths are bounded by the entropy and therefore it would be impossible for them to attain lower code-lengths than CABAC.

VI. CONCLUSION
In this work we proposed a novel compression algorithm for deep neural networks called DeepCABAC, that is based on applying a Context-based Adaptive Binary Arithmetic Coder (CABAC) to the networks parameters, which is the state-ofthe-art universal lossless coder employed in the H.264/HEVC and H.265/HEVC video coding standards. DeepCABAC also incorporates a novel quantization scheme that explicitly minimizes the accuracy vs. bit-size trade-off, without relying on expensive retraining or access to large amounts of data. Experiments showed that it can compress pretrained neural networks by x18.9 on average, and their sparsified versions by x50.6, consistently attaining higher compression performance than previously proposed coding techniques with similar characteristics. Moreover, DeepCABAC is able to capture correlations between the network's parameters, as such being able to compress the networks parameters beyond the entropy limit of codes that only assume a stationary distribution.
As future work we will investigate the impact of compression on the neural network's problem-solving strategies [57] and apply DeepCABAC in distributed training scenarios, where communication overhead of the networks update parameters is critical for the overall training efficiency. ±0.5 percentage point threshold. In both cases the number of clusters was doubled, then 20 experiments with λ in the range of 0.0 to 1.0 were conducted. This process was repeated until the accuracies were within the threshold 7 . Then for all networks two adjacent values of λ were selected where the accuracy lies within the range and that produced the smallest entropies. Again, 20 experiments with λ values between these selected λs were conducted. This process was repeated until there were no longer any gains in the entropies. Typically, only two rounds were enough to find no further improvement.

C. DeepCABAC's hyperparameter selection
In all experiments we set the AbsGr(n)-Flag to 10.

E. DC-v2
For the experiment in section V-A, we searched through the following set of hyperparameters:  [26] proposed a sparsification algorithm for neural networks, that is based on the minimization of a variational objective. Concretely, they assume the improper log-scale uniform prior P (w) and assume a fully factorized gaussian posterior over the weight parameters P (w|µ, σ), and minimize the corresponding variational upper bound (µ, σ) * = min (µ,σ) E P (w|µ,σ) [L(y, y )]+βD KL (P (w|µ, σ)||P (w)) (13) with y ∼ P (y |x, w) being the output samples of the neural network, (x, y) the data samples, and (µ, σ) the mean and standard deviations of all the networks parameters, and β ∈ R the Lagrange-multiplier. As the KL-Divergence cannot be calculated analytically, they proposed to approximate it by D KL (P (w|µ, σ)||P (w)) with sgm(·) being the sigmoid function, α i = σ 2 i /µ 2 i the inverse of the signal-to-noise ratio of the parameters, k 1 = 0.63576, k 2 = 1.87320 and k 3 = 1.48695. Then, they minimize (13) by applying scalable sampling techniques proposed by [58].

A. Connection between pruning and quantization
As a result of minimizing (13) we get a mean and standard deviation for each parameter of the network. In our work, we interpreted the former as their (new) value and the latter as a measure of their "robustness" against perturbations. Indeed, the authors suggested to prune away (set to 0) parameters with a signal-to-noise-ratio under a given threshold. Concretely, they suggested the following pruning scheme w i → 0, ∀i : α −1 < e −3 where w i ≡ µ i represents now the mean value and thus α −1 = w 2 i /σ 2 i . We can see that the scalar rate-distortion objective (10) (Q, Q −1 ) * = min (Q,Q −1 ) is a generalization of the above sparsification scheme. Namely, if we assume that the set of quantization points entails the same elements as the input set W = Q (thus, Q −1 ≡ identity map), and consider a decoder that assumes a spikeand-slab distribution over the quantization points, then the above objective can be solved by applying the Lloyd algorithm. After convergence, it results in the following solution Q * (w i ) = 0 if F i w 2 i < λ(b + log 2 p 0 ), with p 0 being the empirical probability distribution of the 0 value and b the bit-precision for representing the non-zero values. Hence, if we now choose F i = 1/σ 2 i and the adequate λ, we get the suggested criteria as a special case solution. This insight motivated our choice of FIM-diagonals in our experimental section.

B. Connection between variances, Hessian, and FIMdiagonals
Firstly, as thoroughly discussed in [59] and mentioned in [27], it is important to recall that the FIM is a semi-positive approximation of the Hessian of the loss function.
Hence, similar to [27], we can derive a more rigorous connection between the estimated variances from minimizing (13), the FIM-diagonals and the Hessian. Namely, assuming that the variational loss function can be approximated by its second order expansion around the weight configuration w, we get the following expression with L(w) being the loss value at w and tr[·] the trace. Hence, if we substitute this expression into (13) and take the derivative with respect to σ i we attain with K(α −1 ) ∈ (0, 1) being (approximately) a monotonically increasing function of the signal-to-noise ratio of the parameter. Hence, there is a direct connection between the variances, signal-to-noise ratio, and hessian of the loss function and, consequently, with the FIM-diagonals.
C. Hessian-weighted vs. variance-weighted quantization [45] suggested a Hessian-weighted Lloyd algorithm for quantizing the neural networks parameters, where the diagonals of the empirical Hessian are taken as weights in the algorithm. As we have already discussed above, these coefficients are closely connected to the FIM-diagonals, and are thus also theoretically well motivated. However, we experienced the algorithm to be less stable in practice when we used the Hessian-diagonals instead of the variances. Figure 8 shows the rate-accuracy curves when we quantized the LeNet5 model with both alternatives. As we can see, the curves of the variances are more stable, even achieving better compression results than the Hessian-weighted variant.