Compact and Computationally Efficient Representation of Deep Neural Networks

At the core of any inference procedure, deep neural networks are dot product operations, which are the component that requires the highest computational resources. For instance, deep neural networks, such as VGG-16, require up to 15-G operations in order to perform the dot products present in a single forward pass, which results in significant energy consumption and thus limits their use in resource-limited environments, e.g., on embedded devices or smartphones. One common approach to reduce the complexity of the inference is to prune and quantize the weight matrices of the neural network. Usually, this results in matrices whose entropy values are low, as measured relative to the empirical probability mass distribution of its elements. In order to efficiently exploit such matrices, one usually relies on, inter alia, sparse matrix representations. However, most of these common matrix storage formats make strong statistical assumptions about the distribution of the elements; therefore, cannot efficiently represent the entire set of matrices that exhibit low-entropy statistics (thus, the entire set of compressed neural network weight matrices). In this paper, we address this issue and present new efficient representations for matrices with low-entropy statistics. Alike sparse matrix data structures, these formats exploit the statistical properties of the data in order to reduce the size and execution complexity. Moreover, we show that the proposed data structures can not only be regarded as a generalization of sparse formats but are also more energy and time efficient under practically relevant assumptions. Finally, we test the storage requirements and execution performance of the proposed formats on compressed neural networks and compare them to dense and sparse representations. We experimentally show that we are able to attain up to $\times 42$ compression ratios, $\times 5$ speed ups, and $\times 90$ energy savings when we lossless convert the state-of-the-art networks, such as AlexNet, VGG-16, ResNet152, and DenseNet, into the new data structures and benchmark their respective dot product.


I. INTRODUCTION
The dot product operation between matrices constitutes one of the core operations in almost any field in science.
Examples are the computation of approximate solutions of complex system behaviors in physics [1], iterative solvers in mathematics [2] and features in computer vision applications [3].Also deep neural networks heavily rely on dot product operations in their inference [4], [5]; e.g., networks such as VGG-16 require up to 16 dot product operations, which results in 15 giga-operations for a single forward pass.Hence, lowering the algorithmic complexity of these operations and thus increasing their efficiency is of major interest for many modern applications.Since the complexity depends on the data structure used for representing the elements of the matrices, a great amount of research has focused on designing data structures and respective algorithms that can perform efficient dot product operations [6]- [8].
Of particular interest are the so called sparse matrices, a special type of matrices that have the property that many of their elements are zero valued.In principle, one can design efficient representations of sparse matrices by leveraging the prior assumption that most of their element values are zero and therefore, only store the non-zero entries of the matrix.Consequently, their storage requirements become of the order of the number of non-zero values.However, having an efficient representation with regard to storage requirement does not imply that the dot product algorithm associated to that data structure will also be efficient.Hence, a great part of the research was focused on the design of data structures that have as well low complex dot product algorithms [8]- [11].However, by assuming sparsity alone we are implicitly imposing a spikeand-slab prior 1 over the probability mass distribution of the elements of the matrix.If the actual distribution of the elements greatly differs from this assumption, then the data structures devised for sparse matrices become inefficient.Hence, sparsity can be a too constrained assumption for some applications of current interest, e.g., representation of quantized neural networks.
In this work, we alleviate the shortcomings of sparse representations by considering a more relaxed prior over the distribution of the matrix elements.More precisely, we assume that the empirical probability mass distribution of the matrix elements has a low entropy value as defined by Shannon [12].Mathematically, sparsity can be considered a subclass of the general family of low entropic distributions.In fact, sparsity measures the min-entropy of the element distribution, which is related to Shannon's entropy measure through Renyi's generalized entropy definition [13].With this goal in mind, we ask the question: "Can we devise efficient data structures under the implicit assumption that the entropy of the distribution of the matrix elements is low?"We want to stress once more that by efficiency we regard two related but distinct aspects 1) efficiency with regard to storage requirements 2) efficiency with regard to algorithmic complexity of the dot product associated to the representation For the later, we focus on the number of elementary operations required in the algorithm, since they are related to the energy and time complexity of the algorithm.It is well known that the minimal bit-length of a data representation is bounded by the entropy of it's distribution [12].Hence, matrices with low entropic distributions automatically imply that we can design data structures that do not require high storage resources.In addition, as we will discuss in the next sections, low entropic distributions also attain gains in efficiency if these data structures implicitly encode the distributive law of multiplications.By doing so, a great part of the algorithmic complexity of the dot product is reduced to the order of the number of shared weights per row in a matrix.This number is related to the entropy, such that it is small as long as the entropy of the matrix is low.Therefore, these data structures not only attain higher compression gains, but also require less total number of operations when performing the dot product.
Our contributions can be summarized as follows: • We propose new highly efficient data structures that exploit on the prior that the matrix has a low number of shared weights per row (i.e., low entropy).• We provide a detailed analysis of the storage requirements and algorithmic complexity of performing the dot product associated to these data structures.• We establish a relation between the known sparse and the proposed data structures.Namely, sparse matrices belong to the same family of low entropic distributions, however, they can be considered a more constrained subclass of them.• We show through experiments that indeed, these data structures attain gains in efficiency on simulated as well as real-world data.In particular, we show that up to x42 compression ratios, x5 speed ups and x90 energy savings can be achieved when we benchmark the compressed weight matrices of state-of-the-art neural networks relative to the matrix-vector multiplication.In the following Section II we introduce the problem of efficient representation of neural networks and briefly review related literature.In Section III the proposed data structures are given.We demonstrate through a simple example that these data structures are able to: 1) achieve higher compression ratios than their respective dense and sparse counterparts and 2) reduce the algorithmic complexity of performing the dot product.Section IV analyses the storage and energy complexity of these novel data structures.Experimental evaluation is performed in Section V using simulations as well as state-of-the-art neural networks such as AlexNet, VGG-16, ResNet152 and DenseNet.Section VI concludes the paper with a discussion.
In their most basic form, they constitute a chain of affine transformations concatenated with a non linear function which is applied element-wise to the output.Hence, the goal is to learn the values of those transformation or weight matrices (i.e., parameters) such that the neural network performs it's task particularly well.The procedure of calculating the output prediction of the network for a particular input is called inference.The computational cost of performing inference is dominated by computing the affine transformations (thus, the dot products between matrices).Since today's neural networks perform many dot product operations between large matrices, this greatly complicates their deployment onto resource constrained devices.
However, it has been extensively shown that most neural networks are overparameterized, meaning that there are many more parameters than actually needed for the tasks of interest [24]- [27].This implies that these networks are highly inefficient with regard to the resources they require when performing inference.This fact motivated an entire research field of model compression [28].One of the suggested approaches is to: 1) compress the weight elements of the neural network without (considerably) affecting their prediction accuracy and 2) convert the resulting weights into a representation that achieves high compression ratios and is able to execute the dot product operation efficiently.Whilst there has been a plethora of work focusing on the first step [26], [27], [29]- [39], previous literature has not focused as much on the second part.As a consequence, most of the research has focused on developing techniques that either sparsify the networks weights [27], [29]- [31] or reduce the cardinality of the weight elements [32]- [34], since then sparse matrix representations or dense matrices with compressed numerical representations can be employed in order to efficiently perform inference.
However, this greatly reduces the possible efficiency gains that can be achieved.In fact, highest reported compression gains are attained with techniques that either implicitly [26], [38] or explicitly [35]- [37], [39] attempt to reduce the entropy of the weight matrices of the network.To recall, throughout this work we consider the entropy of the empirical probability mass distribution of the weight elements.That is, we first identify the set of unique elements that appear in the matrix, denoted as Ω.Then, for each element in ω k ∈ Ω, we count it's frequency of appearance and divide it by the total number of elements in the matrix, resulting in the probability mass value p k = #(ω k )/N , where #(•) is the counting operator and N the total number of elements in the matrix.Finally, we calculate Shannon's entropy However, with no other means for representing the resulting compressed weight matrices, the achievable efficiency gains are bounded by the limitations of the sparse or dense representations.
For instance, figure 1 demonstrates the discrepancy between the sparsity assumption and the real distribution of weight elements.It plots the distribution of the weight elements of the last classification layer of VGG-16 [40] (1000 × 4096 dimensional matrix), after having applied uniform quantization on the weight elements.We stress that the prediction accuracy and generalization of the network was not affected by this operation.On the one hand, as we can see, the distribution of the compressed layer does not satisfy the sparsity assumption, i.e., there is not one particular element (such as 0) that appears specially frequent in the matrix.The most frequent value is -0.008 and it's frequency of appearance does not dominate over the others (about 4.2%).On the other hand, naively compressing the numerical values of the matrix elements down to a trivial 7-bit representation would also result in an inefficient representation.Since the activation values are still represented in single precision floating point values 2 , the respective dot product algorithm would require multiple, mostly expensive decoding operations in order to convert back each element of the weight matrix into it's original 32-bit floating point value.
Hence, neither sparse matrix representations nor the (compressed) dense representations can efficiently exploit the statistical properties of the weight matrix.
In this work, we overcome these limitations and present new matrix representations that become more efficient as the entropy of the weight matrices is reduced.In particular, their complexity depend partially on the number of shared weights present in the matrix, which is reduced as the entropy of the matrix is reduced.Indeed, we notice that for the matrix in figure 1 most of the entries are dominated by only 15 distinct values, which is 1.5% of the number of columns of the matrix.In the next section we will describe with a simple example how these new representations leverage on this property in order to achieve both, high compression ratios and efficient dot products.

III. DATA STRUCTURES FOR MATRICES WITH LOW ENTROPY STATISTICS
In this section we introduce the proposed data structures and show that they implicitly encode the distributive law.Consider the following matrix Now assume that we want to: 1) store this matrix with the minimum amount of bits and 2) perform the dot product with a vector a ∈ R 12 with the minimum complexity.

A. Minimum storage
We firstly comment on the storage requirement of dense and sparse formats and then introduce two new formats which 2 In this case, compressing the activation values down to a 7-bit representation would have significantly harmed the prediction accuracy of the network.Dense format: Arguably the simplest way to store the matrix M is in it's so called dense representation.That is, we store it's elements in a 5 × 12 long array (in addition to it's dimensions m = 5 and n = 12).
Sparse format: However, notice that more than 50% of the entries are 0. Hence, we may be able to attain a more compressed representation of this matrix if we store it in one of the well known sparse data structure, for instance, in the Compressed Sparse Row (or CSR in short) format.This particular format stores the values of the matrix in the following way: • Scans the non-zero elements in row-major order (that is, from left to right, up to down) and stores them in an array (which we denote as W ).  7,13,18,24,28] If we assume the same bit-size per element for all arrays, then the CSR data structure does not attain higher compression gains in spite of not saving the zero valued elements (62 entries vs. 60 that are being required by the dense data structure).
We can improve this by exploiting the low-entropy property of matrix M .In the following, we propose two new formats which realize this.
Compressed Entropy Row (CER) format: Firstly, notice that many elements in M share the same value.In fact, only the four values Ω = {0, 4, 3, 2} appear in the entire matrix.Hence, it appears reasonable to assume that data structures that repeatedly store these values (such as the dense or CSR structures) induce high redundancies in their representation.Therefore, we propose a data structure where we only store those values once.Secondly, notice that different elements appear more frequent than others, and their relative order does not change throughout the rows of the matrix.Concretely, we have a set of unique elements Ω = {0, 4, 3, 2} which appear P # = {32, 21, 4, 3} times respectively in the matrix, and we obtain the same relative order of highest to lowest frequent value throughout the rows of the matrix.Hence, we can design an efficient data structure which leverages on both properties in the following way: 1) Store unique elements present in the matrix in an array in frequency-major order (that is, from most to least frequent).We name this array Ω. 2) Store respectively the column indices in row-major order, excluding the first element (thus excluding the most frequent element).We denote it as colI.3) Store pointers that signal when the positions of the next new element in Ω starts.We name it ΩP tr.If a particular pointer in ΩP tr is the same as the previous one, this means that the current element is not present in the matrix and we jump to the next element.4) Store pointers that signal when a new row starts.We name it rowP tr.Here, rowP tr points to entries in ΩP tr.Hence, this new data structure represents matrix M as Ω :[0, 4, 3, 2] colI :[4, 9, 11, 1, 8, 3, 7, 0, 1, 5, 8, 9, 11, 0, 3, 7, 2, 9, 3, 4, 5, 8, 9, 7, 1, 2, 5, 7] ΩP tr :[0, 3,5,7,13,16,17,18,23,24,28] rowP tr :[0, 3,4,7,9,10] Notice that we can uniquely reconstruct M from this data structure.We refer to this data structure as the Compressed Entropy Row (or CER in short) data structure.One can verify that indeed, the CER data structure only requires 49 entries (instead of 60 or 62) attaining as such a compressed representation of the matrix M .
To summarize, the CER representation is able to attain higher compression gains because it leverages on the following two properties: 1) many matrix elements share the same value and 2) the empirical probability mass distribution of the shared weight elements does not change significantly across rows.
Compressed Shared Elements Row (CSER) format: In some cases, it may well be that the probability distribution across rows in a matrix are not similar to each other.Hence, the second assumption in the CER data structure would not apply and we would only be left with the first one.That is, we only know that not many distinct elements appear per row in the matrix or, in other words, that many elements share the same value.The compressed shared elements row (or CSER in short) data structure is a slight extension to the previous CER representation.Here, we add an element pointer array, which signals which element in Ω the colI indices refer to.We called it ΩI.Thus, ΩI points to entries in Ω, ΩP tr to entries in colI and rowP tr to entries in ΩP tr.Hence, the above matrix would then be represented as follows Ω :[0, 2, 3, 4] colI :[4, 9, 11, 1, 8, 3, 7, 0, 1, 5, 8, 9, 11, 0, 3, 7, 2, 9, 3, 4, 5, 8, 9, 7, 1, 2, 5, 7] ΩI :[3, 2, 1, 3, 3, 2, 1, 3, 2, 3] ΩP tr :[0, 3,5,7,13,16,17,18,23,24,28] rowP tr :[0, 3,4,7,9,10] Thus, for storing matrix M we require 59 entries, which is still a gain but not a significant one.Notice, that now the ordering of the elements in Ω is not important anymore, as long as the ΩI array is accordingly adjusted.Similarly, the ordering of ΩI at each row can also be arbitrary, as long as the ΩP tr and colI array are accordingly adjusted.
The relationship between CSER, CER and CSR data structures is described in Section IV.

B. Dot product complexity
We just saw that we can attain gains with regard to compression if we represent the matrix in the CER and CSER data structures.However, we can also devise corresponding dot product algorithms that are more efficient than their dense and sparse counterparts.As an example, consider only the scalar product between the second row of matrix M with an arbitrary input vector a = [a 1 a 2 . . .a 12 ] .In principle, the difference in the algorithmic complexity arises because each data structure implicitly encodes a different expression of the scalar product, namely dense : 4a 1 + 4a 2 + 0a 3 + 0a 4 + 0a 5 + 4a 6 + 0a 7 + 0a 8 + 4a 9 + 4a 10 + 0a 11 + 4a 12 CSR : 4a 1 + 4a 2 + 4a 6 + 4a 9 + 4a 10 + 4a 12 CER/CSER : 4(a 1 + a 2 + a 6 + a 9 + a 10 + a 12 ) For instance, the dot product algorithm associated to the dense format would calculate the above scalar product by 1) loading M and a.
In contrast, the dot product algorithm associated with the CSR representation would only multiply-add the non-zero entries.It does so by performing the following steps 1) Load the subset of rowP tr respective to row 2. Thus, rowP tr → [7,13].2) Then, load the respective subset of non-zero elements and column indices.Thus, W → [4,4,4,4,4,4] and colI → [0, 1,5,8,9,11].3) Finally, load the subset of elements of a respective to the loaded subset of column indices and subsequently multiply-add them to the loaded subset of W . Thus, a → [a 0 , a 1 , a 5 , a 8 , a 10 , a 11 ] and calculate 4a 0 + 4a 1 + 4a 5 + 4a 8 + 4a 9 + 4a 11 .By executing this algorithm we would require 20 load operations (2 from the rowP tr and 6 for the W , the colI and the input vector respectively), 6 multiplications, 5 additions and 1 write.In total this dot product algorithm requires 32 operations.
However, we can still see that the above dot product algorithm is inefficient in this case since we constantly multiply by the same element 4. Instead, the dot product algorithm associated to, e.g., the CER data structure, would perform the following steps 1) Load the subset of rowP tr respective to row 2. Thus, rowP tr → [3,4].2) Load the corresponding subset in ΩP tr.Thus, ΩP tr → [7,13].3) For each pair of elements in ΩP tr, load the respective subset in colI and the element in Ω.Thus, Ω → [4] and colI → [0, 1, . 4) For each loaded subset of colI, perform the sum of the elements of a respective to the loaded colI.Thus, a → [a 0 , a 1 , a 5 , a 8 , a 10 , a 11 ] and do a 0 +a 1 +a 5 +a 8 + a 9 + a 11 = z.5) Subsequently, multiply the sum with the respective element in Ω.Thus, compute 4z.A similar algorithm can be devised for the CSER data structure.One can find both pseudocodes in the appendix.The operations required by this algorithm are 17 load operations (2 from rowP tr, 2 from ΩP tr, 1 from Ω, 6 from colI and 6 from a), 1 multiplication, 5 additions and 1 write.In total these are 24 operations.Hence, we have observed that for the matrix M , the CER (and CSER) data structure does not only achieve higher compression rates, but it also attains gains in efficiency with respect to the dot product operation.
In the next section we give a detailed analysis about the storage requirements needed by the data structures and also the efficiency of the dot product algorithm associated to them.This will help us identify when one type of data structure will attain higher gains than the others.

IV. AN ANALYSIS OF THE STORAGE AND ENERGY COMPLEXITY OF DATA STRUCTURES
Without loss of generality, in the following we assume that we aim to encode a particular matrix M ∈ Ω n×m=N , where it's elements M ij = ω k ∈ Ω take values from a finite set of elements Ω = {ω 0 , ω 1 , ..., ω K−1 }.Moreover, we assign to each element ω k a probability mass value p k = #(ω k )/N , where #(ω k ) counts the number of times the element ω k appears in the matrix M .We denote the respective set of probability mass values P Ω = {p 0 , p 1 , ..., p K−1 }.In addition, we assume that each element in Ω appears at least once in the matrix (thus, p k > 0 for all k = 0, ..., K − 1) and that ω 0 = 0 is the most frequent value in the matrix.Finally, we order the elements in Ω and P Ω in probability-major order, that is,

A. Measuring the energy efficiency of the dot product
This work proposes representations that are efficient with regard to storage requirements as well as their dot product algorithmic complexity.For the latter, we focus on the energy requirements, since we consider it as the most relevant measures for neural network compression.However, exactly measuring the energy of an algorithm is unreliable since it depends on the software implementation and on the hardware the program is running on.Therefore, we will model the energy costs in a way that can easily be adapted across different software implementations as well as hardware architectures.
In the following we model a dot product algorithm by a computational graph, whose nodes can be labeled with one of four elementary operations, namely: 1) a mul or multiply operation which takes two numbers as input and outputs their multiplied value, 2) a sum or summation operation which takes two values as input and outputs their sum, 3) a read operation which reads a particular number from memory and 4) a write operation which writes a value into memory.Note, that we do not consider read/write operations from/into low level memory (like caches and registers) that store temporary runtime values, e.g., outputs from summation and/or multiplications, since their cost can be associated to those operations.Now, each of these nodes can be associated with an energy cost.Then, the total energy required for a particular dot product algorithm simply equals the total cost of the nodes in the graph.
However, the energy cost of each node depends on the hardware architecture and on the bit-size of the values involved in the operation.Hence, in order to make our model flexible with regard to different hardware architectures, we introduce four cost functions σ, µ, γ, δ : N → R, which take as input a bit-size and output the energy cost of performing the operation associated to them3 ; σ is associated to the sum operation, µ to the mul, γ to the read and δ to the write operation.
Figure 2 shows the computational graph of a simple dot product algorithm for two 2-dimensional input vectors.This algorithm requires 4 read operations, 2 mul, 1 sum and 1 write.Assuming that the bit-size of all numbers is b ∈ N, we can state that the energy cost of this dot product algorithm would be E = 1σ(b) + 2µ(b) + 4γ(b) + 1δ(b).Note that similar energy models have been previously proposed [41], [42].In read mul write sum Fig. 2: Computational graph of a scalar product algorithm X •Y = z for two 2-dimensional input vectors X, Y .Any such algorithm can be described in terms of four elementary operations (sum, mul, read, write).These elementary operations are associated with functions σ, µ, γ, δ, which take a bit-size b as input and output the energy (and/or time) cost of performing that operation.Hence, assuming that all elements have same bit-size b, the total energy performance of the algorithm can be determined by calculating the experimental section we validate the model by comparing it to real energy results measured by previous authors.
Considering this energy model we can now provide a detailed analysis of complexity of the CER and CSER data structure.However, we start with a brief analysis of the storage and energy requirements of the dense and sparse data structure in order to facilitate the comparison between them.

B. Efficiency analysis of the dense and CSR formats
The dense data structure stores the matrix in an N -long array (where N = m × n) using a constant bit-size b Ω for each element.Therefore, it's effective per element storage requirement is bits.The associated standard scalar product algorithm then has the following per element energy costs We can see that both the storage and the dot product efficiency have a constant cost attached to them, despite the distribution of the elements of the matrix.
In contrast, the CSR data structure requires only effective bits per element in order to represent the matrix, where b I denotes the bits-size of the column indices.This comes from the fact that we need in total N (1 − p 0 )b Ω bits for representing the non-zero elements of the matrix, N (1 − p 0 )b I bits for their respective column indices and mb I bits for the row pointers.Moreover, it requires units of energy per matrix element in order to perform the dot product.The expression ( 4 Different to the dense format, the efficiency of the CSR data structure increases as p 0 → 1, thus, as the number of zero elements increases.Moreover, if the matrix size is large enough, the storage requirement and the cost of performing a dot product becomes effectively 0 as p 0 → 1.
For the ease of the analysis, we introduce the big O notation for capturing terms that depend on the shape of the matrix.In addition, we denote the following set of operations c a can be interpreted as the total effective cost of involving an element of the input vector in the dot product operation.
Analogously can c Ω be interpreted with regard to the elements of the matrix.Hence, we can rewrite the above equations ( 2) and (4) as follows C. Efficiency analysis of the CER and CSER formats Following a similar reasoning as above, we can state the following theorem Theorem 1: Let M ∈ R m×n be a matrix.Let further p 0 ∈ (0, 1) be the empirical probability mass distribution of the zero element, and let b I ∈ N be the bit-size of the numerical representation of a column or row index in the matrix.Then, the CER representation of M requires effective bits per matrix element, where k denotes the average number of shared elements that appear per row (excluding the most frequent value), k the average number of padded indices per row and N = m × n the total number of elements of the matrix.Moreover, the effective cost associated to the dot product with an input vector a ∈ R n is per matrix element, where c a and c Ω are as in ( 5) and ( 6).Analogously, we can state Theorem 2: Let M , p 0 , b I , k, c a , c Ω be as in theorem 1.Then, the CSER representation of M requires effective bits per matrix element, and the per element cost associated to the dot product with an input vector a ∈ R n is The proofs of theorems 1 and 2 are in the appendix.These theorems state that the efficiency of the data structures depends on the ( k, p 0 ) (average number of distinct elements per rowsparsity) values of the empirical distribution of the elements of the matrix.That is, these data structures are increasingly efficient for distributions that have high p 0 and low k values.However, since the entropy measures the effective average number of distinct values that a random variable outputs 4 , both values are intrinsically related to it.In fact, from Renyi's generalized entropy definition [13] we know that p 0 ≥ 2 −H .Moreover, the following properties are satisfied • k → min{K − 1, n}, as H → log 2 K or n → ∞, and Consequently, we can state the following corollary Corollary 2.1: For a fixed set size of unique element |Ω| = K and constant index bit-size b I , the storage requirements S as well as the cost of the dot product operation E of the CER and CSER representations satisfy where p 0 , b I , n and N are as in theorems 1 and 2, and H denotes the entropy of the matrix element distribution.Thus, the efficiency of the CER and CSER data structures increase as the column size increases, or as the entropy decreases.Interestingly, when n → ∞ both representations will converge to the same values, thus, will become equivalent.In addition, there will always exist a column size n where both formats are more efficient than the original dense and sparse representations (see Fig. 5 where this trend is demonstrated experimentally).

D. Connection between CSR, CER and CSER
The CSR format is considered to be one of the most general sparse matrix representations, since it makes no further assumptions regarding the empirical distribution of the matrix elements.Consequently, it implicitly assumes a spike-andslab5 distribution on them.However, spike-and-slab distributions are a particular class of low entropic (for sufficiently high sparsity levels p 0 ) distributions.In fact, spike-and-slab distributions have the highest entropy values compared to all other distributions that have same sparsity level.In contrast, as a consequence of corollary 2.1, the CER and CSER data structures relax this prior and can therefore efficiently represent the entire set of low entropic distributions.Hence, the CSR data structure can be interpreted as a more specialized version of the CER and CSER representations.
This may be more evident via the following example: consider the 1st row of the matrix example from section III In comparison, the CER representation assumes that the ordering of the elements in ΩI is similar for all rows and therefore, it directly omits this array and implicitly encodes this information in the Ω array.Therefore, the CER representation can be interpreted as a more explicit/specialized version of the CSER.The representation would then be Ω :[0, 4, 3, 2] colI : [4,9,11,1,8,3,7] ΩP tr :[0, 3,5,7] rowP tr :[0, 3] Similarly, the CSR representation omits the ΩP tr array since it assumes a uniform distribution over the non-zero elements (thus, over the Ω array), and in such case all the entries in ΩP tr would redundantly be equal to 1. Therefore, the respective representation would be Ω : [3,2,4,2,3,4,4] colI : [1,3,4,7,8,9,11] rowP tr :[0, 7] Consequently, the CER and CSER representations will have superior performance for all those distributions that are not similar to the spike-and-slab distributions.Figure 3 displays a sketch of the regions on the entropy-sparsity plane where we expect the different data structures to be more efficient.The sketch shows that the efficiency of sparse data structures is high on the subset of distributions that are close to the right border line of the (H, p 0 )-plane, thus, that are close to the family of spike-and-slab distribution.In contrast, dense representations are increasingly efficient for high entropic distributions, hence, in the upper-left region.The CER and CSER data structures would then cover the rest of them.Figure 4 confirms this trend experimentally.

V. EXPERIMENTS
We applied the dense, CSR, CER and CSER representations on simulated matrices as well as on quantized neural network weight matrices, and benchmarked their efficiency with regard to the following four criteria: 1) Storage requirements: We calculated the storage requirements according to equations ( 1), ( 3), ( 9) and (11).2) Number of operations: We implemented the dot product algorithms associated to the four above data structures (pseudocodes of the CER and CSER formats can be seen in the appendix) and counted the number of elementary operations they require to perform a matrixvector multiplication.3) Time complexity: We timed each respective elementary operation and calculated the total time from the sum of those values.4) Energy complexity: We estimated the respective energy cost by weighting each operation according to Table I.The total energy results consequently from the sum of those values.As for the case of the IO operations (read/write operations), their energy cost depend on the size of the memory the values reside on.Therefore, we calculated the total size of the array where a particular number is entailed and chose the respective maximum energy value.For instance, if a particular column index is stored using a 16 bit representation and the total size of the column index array is 30KB, then the respective read/write energy cost would be 5.0 pJ.In addition, we used single precision floating point representations for the matrix elements and unsigned integer representations for the index and pointer arrays.For the later, we compressed the index-element-values to their minimum required bit-sizes, where we restricted them to be either 8, 16 or 32 bits.
Notice that we do not consider the complexity of converting the dense representation into the different formats in our experiments.This is justified in the context of neural network compression since we can apply this step a priori to the inference procedure.That is, in most real world scenarios one firstly convert the weight matrices, possibly with help of a capable computer, and then deploys the converted neural network into a resource constrained device.We are mostly interested in the resource consumption that will take place on the device.Nevertheless, as an additional side note we would like to mention that the algorithmic complexity of conversion into the CSR, CER and CSER representations is of O(N ), that is, of the order of number of elements in the matrix.

A. Experiments on simulated matrices
As first experiments we aimed to confirm the theoretical trends described in Section IV.
1) Efficiency on different regions of the entropy-sparsity plane: Firstly, we argued that each distribution has a particular entropy-sparsity value, and that the superiority of the different data structures is manifested in different regions on that We compare the dense data structure (blue), the CSR format (green) and the proposed data structures (red).The colors indicate which of the three categories was the most efficient at that point in the plane.The proposed data structures tend to be more efficient in the down left region of the plane.In contrast, sparse data structures tend to be more efficiency in the upper right corner, whereas dense structures in the upper left corner.For this experiment we employed a 100 × 100 matrix and calculated the average complexity over 10 matrix samples at each point.The size of the set of the elements was 2 7 .
plane.Concretely, we expected the dense representation to be increasingly more efficient in the upper-left corner, the CSR on the bottom-right (and along the right border) and the CER and CSER on the rest.Figure 4 shows the result of performing one such experiment.In particular, we randomly selected a point-distribution on the (H, p 0 )-plane and sampled 10 different matrices from that distribution.Subsequently, we converted each matrix into the respective dense, CSR, CER and CSER representation, and benchmarked the performance with regard to the 4 different measures described above.We then averaged the results over these 10 different matrices.Finally, we compared the performances with each other and respectively color-coded the max result.That is, blue corresponds to points where the dense representation was the most efficient, green to the CSR and red to either the CER or CSER.As one can see, the result closely matches the expected behavior.
2) Efficiency as a function of the column size: As second experiment, we study the asymptotic behavior of the data structures as we increase the column size of the matrices.From corollary 2.1 we expect that the CER and CSER data structures increase their efficiency as the number of columns in the matrix grows (thus, as n → ∞), until they converge to the same point, outperforming the dense and sparse data structures.The proposed data structures tend to be more efficient as the column dimension of the matrix increases, and converge to the same value for n → ∞.
Figure 5 confirms this trend experimentally with regard to all four benchmarks.Here we chose a particular point-distribution on the (H, p 0 )-plane and fixed the number of rows.Concretely, we chose H = 4.0, p 0 = 0.55 and m = 100 (the later is the row dimension), and measured the average complexity of the data structures as we increased the number of columns n → ∞.
As a side note, the sharp changes in the plots are due to the sharp discontinuities in the values of table I.For instance, the sharp drops in storage ratios come from the change of the index bit-sizes, e.g., from 8 → 16 bits.

B. Compressed Neural Networks without Retraining
As second set of experiments, we tested the efficiency of the proposed data structures on compressed deep neural networks.In particular, we benchmarked their weight matrices relative to the matrix-vector operation, after them being compressed using two different types of quantization techniques: one where retraining of the network is required (section V-C) and one where it is not (section V-B).We treat them separately, since the statistics of the resulting compressed weight matrices are conditioned by the quantization applied on them.
We start by first analyzing the later case.This scenario is of particular interest since it applies to cases where one does not have access to the training data (e.g., federated learning scenario) or it is prohibited to retrain the model (e.g., limited access to computational resources).Moreover, common matrix representations, such as the dense or CSR, may fail to efficiently exploit the statistics present in these compressed weight matrices (see figure 1 and discussion in section II).
In our experiments we firstly quantized the elements of the weight matrices of the networks in a lossy manner, while Fig. 6: Storage requirements of a compressed DenseNet [44] after converting it's weight matrices into the different data structures.The weights of the network layers were compressed down to 7 bits (resulting accuracy is 77.09%).The plots show the over the layers averaged result.Top chart: Compression ratio relative to the dense representation.Bottom charts: Contribution of the different parts of the data structures to the storage requirements.For the CER/CSER formats, most of the storage goes to the column indices.
ensuring that we negligible impact their prediction accuracy.Similarly to [35], [36], we applied an uniform quantizer over the range of weight values at each layer and subsequently rounded the values to their nearest quantization point.That is, for each weight matrix W , we calculated the range of values [w min , w max ] (with w min being the lowest weight element value and w max analogously) and inserted K = 2 b equidistant points inside that range, whose values were stored in the array Ω.Then, we quantized each weight element in W to it's closest neighbor relative to Ω and measured the validation accuracy of the quantized network.In our experiments, we did not see any significant impact on the accuracy for all b ≥ 7 (table II).We chose the uniform quantizer because of it's simplicity and high performance relative to other, more sophisticated quantizers such as entropy-constrained k-mean algorithms [35], [36].Finally, we lossless converted the quantized weight matrices into the different data structures and tested their efficiency with regard to the four above mentioned benchmark criteria.
1) Storage requirements: Table II shows the gains in storage requirements of different state-of-the-art neural networks.Gains can be attained when storing the networks in CER or CSER formats.In particular, we achieve more than x2.5 savings on the DenseNet architecture, whereas in contrast the CSR data structure attains negligible gains.This is mainly attributed to the fact, that the dense and sparse representations store very inefficiently the weight element values of these networks.This is also reflected in Fig. 6, where one can see that most of the storage requirements for the dense and Fig. 7: Number of operations required to perform a dot product in the different formats for the experimental setup described in Fig. 6 (DenseNet).The CER/CSER formats require less operations than the other formats, because 1) they do not need to perform as many multiplications and 2) they do not need to load as many matrix weight elements.
TABLE II: Storage gains of different state-of-the-art neural networks after their weight matrices have been compressed down to 7 bits and, subsequently, converted into the different data structures.The gains are relative to the original dense representation of the compressed weight matrices, and they show the over the layers aggregated results.The accuracy is measured with regard to the validation set (in parenthesis we show the accuracy of the uncompressed model) of the ImageNet classification task.CSR representations is spent in storing the elements of the weight matrices Ω.In contrast, most of the storage cost for the CER and CSER data structures comes from storing the column indices colI, which is much lower than the actual weight values.
2) Number of operations: Table III shows the savings attained with regard to number of elementary operations needed to perform a matrix-vector multiplication.As one can see, we can save up to 40% of the number of operations if we use the CER/CSER data structures on the DenseNet architecture.This is mainly due to the fact, that the dot product algorithm of the CER/CSER formats implicitly encode the distributive law of multiplications and consequently they require much less number of them.This is also reflected in Fig. 7, where one can see that the CER/CSER dot product algorithms are mainly Fig. 8: Time cost of a dot product in the different formats for the experimental setup described in Fig. 6 (DenseNet).The CER/CSER formats save time, because 1) they do not require to perform as many multiplications and 2) they do not spend as much time loading the matrix weight elements.TABLE III: Gains attained with regard to the number of operations, time and energy cost needed for performing a matrixvector multiplication with the compressed weight matrices of different state-of-the-art neural networks.The experiment setting and table structure is the same as in Table II.The performance gains are relative to the original dense representation of the compressed weight matrices, and they show the over the layers aggregated results.performing input load (In load ), column index load (colI load ) and addition (add) operations.Here, others refers to any other operation involved in the dot product, such as multiplications, weight loading, writing, etc.In contrast, the dense and CSR dot product algorithms require an additional equal number of weight element load (Ω load ) and multiplication (mul) operations.

#ops [G] time [s] energy [J]
3) Time cost: In addition, Table III also shows that we attain speedups when performing the dot product in the new Fig.9: Energy cost of a dot product in the different formats for the experimental setup described in Fig. 6 (DenseNet).Performing loading operations consumes up to 3 orders more energy than sum and mul operations (see Table I).Since the CER/CSER formats need substantially less matrix weight element loading operations, they attain great energy saving compared to the dense and CSR formats.representation.Interestingly, Fig. 8 shows that most of the time is being consumed on IO's operations (that is, on load operations).Consequently, the CER and CSER data structures attain speedups since they do not have to load as many weight elements.In addition, 20% and 16% of the time is spent in performing multiplications respectively in the dense and sparse representation.In contrast, this time cost is negligible for the CER and CSER representations.
4) Energy cost: Similarly, we see that most of the energy consumption is due to IOs operations (Fig. 9).Here the cost of loading an element may be up to 3 orders of magnitude higher than any other operations (see Table I) and therefore, we obtain up to x6 energy savings when using the CER/CSER representations (see Table III).
Finally, Table IV and Fig. 10 further justify the observed gains.Namely, Table IV shows that the effective number of shared elements per row of the network is small relative to the networks effective column dimension.To clarify, we calculated the effective number of shared elements by: 1) for all rows, calculate the number of shared weights, 2) aggregating the numbers and 3) dividing the result by the total number of rows that appear in the network.Similarly, the effective number of columns indicates the average number of columns in the network, and the effective sparsity level as well as effective entropy values indicate the over the total number of weights averaged result.Fig. 10 shows the distributions of the different layers of the networks on the entropy-sparsity plane where we see, that most of them lay in the regions where we expect the CER/CSER formats to be more efficient.
On a last side note we would like to comment on the  For instance, after quantization, we could trivially compress the weight element values down to a 7-bit representation, or apply more sophisticated entropy-coders [35], [36].Although these representation of the dense format are able to attain relatively high compression ratios, they are inefficient with regard to the dot product algorithm, since additional decoding steps are required in order to convert back the weight values into their original floating point representations.Recall, that in this case the activation values would still be represented by single precision floating point values, and quantizing them down to 7 bits would significantly harm the prediction accuracy of the network.As an example, the matrix-vector product operation of the VGG-16 architecture slowed down by about 47% compared to the original dense representation, after we converted each weight element down into it's 7-bit representation.

C. Compressed Neural Networks with Retraining
In this section we benchmark the CER/CSER matrix representation on networks whose weight matrices have been after converting it's weight matrices into the different data structures and benchmarking their matrix-vector dot product operation.The network was compressed using the deep compression [26] technique.The plots show the over the layers aggregated results compared to the original dense data structure.
compressed using quantization techniques where retraining was required in the process.This case is also of particular interest since highest compression gains can only be achieved if one applies such quantizations techniques on to the network [26], [27], [37]- [39].
For instance, Deep Compression [26] is a technique for compressing neural networks that is able to attain high compression rates without incurring significant loss of accuracy.It is able to do so by applying a three staged pipeline: 1) prune unimportant connections by employing algorithm [31], 2) cluster the non-pruned weight values and refine the cluster centers to the loss surface and 3) employ an entropy coder for storing the final weights.Notice, that the first two stages aim to implicitly minimize the entropy of the weight matrices without incurring significant loss of accuracy, whereas the third stage lossless converts the weight matrices into low-bit representation.However, the proposed representation is based on the CSR format and, consequently, the complexity of the respective dot product algorithm remains on the same order.Concretely, the total number of operations that need to be performed is greater equal to the original CSR format.In fact, one requires specialized hardware in order to efficiently exploit this final neural network representation during inference [46].Therefore, many authors benchmark the inference efficiency of highly compressed deep neural networks with regard to the standard CSR representation when tested on standard hardware such as CPU's and/or GPU's [26], [38], [41].However, this comes at the cost of adding redundancies since then one does not exploit step 2 of the compression pipeline.
In contrast, the CER/CSER representation become increasingly efficient as the entropy of the network is reduced, even if the sparsity level is maintained (see figures 3 and 4).Hence, it is of high interest to benchmark their efficiency on highly compressed networks and compare them to their sparse (and dense) counterparts.
As first experimental setup we chose the by the authors trained and quantized 6 AlexNet architecture [45], where they were able to reduce the overall entropy of the network down to 0.89 without incurring any loss of accuracy.Figure 11 shows the gains in efficiency when the network layers are converted into the different data structures.We see, that the proposed data structures are able to surpass the dense and sparse data structures for all four benchmark criteria.Therefore, CER/CSER data structures are much less redundant and efficient representations of highly compressed neural network models.Interestingly, the CER/CSER data structures attain up to x14 storage and x20 energy savings, which is considerably higher than the sparse counterpart.Nevertheless, we do not attain significant time gains.This is due to the fact that, in our implementations, the time cost of loading the input elements was significantly higher than any other component in the algorithm (see figure 14 in appendix).This also explains why the CSR format shows similar speedups than the CER and CSER.However, this effect can be mitigated if one applies further optimizations on the input vector, such as data reuse techniques and/or better storage management of it's values during the dot product procedure.We would also consequently expect significant gains in time performance relative to the CSR format.We will consider it in future work.
Lastly, we trained and compressed additional architectures while following a similar compression pipeline as described in [26].Concretely, we: 1) pretrained the architectures until we reached state-of-the-art accuracies, 2) sparsified the architectures using the technique proposed in [27], 3) applied a uniform quantizer to the non-zero values in order to reduce their effective bit-size, finally, 4) converted the weight matrices into the different representations and benchmarked their efficiency relative to their matrix-vector product operation.In step 2) we chose [27] since it is the current state-of-the-art sparsification technique.In our experiments we chose to benchmark the same architectures as reported in [27], [38].That is, an adapted version of the VGG network 7 for the CIFAR-10 image classification task and the fully connected and convolutional LeNet architectures for the MNIST classification task.The respective accuracies and compression gains can be seen in tables V and the gains relative to the dot product complexity in table VI.As we can see, we attain significantly higher gains in all four benchmarks when we convert their weight matrices into the CER/CSER representations.In particular, we are able to attain up to x42 compression gains, x5 speedups and x90 energy gains on the VGG model.
As a last side note we want to mention again that compressing further the CSR representation by, for instance, replacing the non-zero values by their respective quantization indices (as proposed by [26]), does not necessarily result in higher gains with regards to the dot product since it requires an additional decoding step per non-zero element in the process.For instance, we got only x2.89 speedups on our compressed CIFAR10-VGG model, which is less than the speedups attained by the original CSR format (x3.63).Moreover, the CER/CSER representations still attained higher gains in all other complexity measures.Concretely, we attained x33.62, x3.10 and x62.32 gains in storage, number of operations and 7 http://torch.ch/blog/2015/07/30/cifar.html.TABLE V: Storage gains of different neural networks after they have been compressed by the procedure described in section V-C.The VGG model was trained on the CIFAR-10 data set and we used the same architecture as benchmarked in [27], [38].The LeNet architectures were trained on the MNIST data set, and we took as well the same versions as benchmarked in [27], [38].The accuracy column (Acc) shows the accuracies of the compressed models, and in parenthesis the accuracies of the pretrained models.Finally, the sparsity column (sp) displays the ratio between the non-zero weight values and the total number of weight elements.energy respectively, which is still lower than the gains attained by the CER/CSER representations (tables V and VI).
VI. CONCLUSION We presented two new matrix representations, Compressed Entropy Row (CER) and Compressed Shared Elements Row (CSER), that are able to attain high compression ratios and energy savings if the distribution of the matrix elements has low entropy.We showed on an extensive set of experiments that the CER/CSER data structures are more compact and computationally efficient representations of compressed stateof-the-art neural networks than dense and sparse formats.In particular, we attained up to x42 compression ratios and x90 energy savings by representing the weight matrices of an highly compressed VGG model in their CER/CSER forms and benchmarked against the matrix-vector product operation.
By demonstrating the advantages of entropy-optimized data formats for representing neural networks, our work opens up new directions for future research, e.g., the exploration of entropy constrained regularization and quantization techniques for compressing deep neural networks.The combination of entropy constrained regularization and quantization and entropy-optimized data formats may push the limits of neural network compression even further and also be beneficial for applications such as federated or distributed learning [47], [48].
Future work will also study lossy compression schemes, specially in combination with their analysis with explanation methods [49], [50].

APPENDIX A DETAILS ON NEURAL NETWORK EXPERIMENTS A. Matrix preprocessing and convolutional layers
Before benchmarking the quantized weight matrices we applied the following preprocessing steps: 1) Matrix decomposition: After the quantization step it may well be that the 0 value is not included in the set of values and/or that it's not the most frequent value in the matrix.Therefore, we applied the following simple preprocessing steps: assume a particular quantized matrix W q ∈ R m×n , where each element (W q ) ij ∈ Ω := {ω 0 = 0, ..., ω K−1 } belong to a discrete set.Then, we decompose the matrix into the identity W q = (W q − ω max 1 1) + ω max 1 1 = Ŵ + ω max 1 1, where 1 1 is the unit matrix whose elements are equal to 1 and ω max is the element that appears most frequently in the matrix.Consequently, Ŵ is a matrix with 0 as it's most frequent element.Moreover, when performing the dot product with an input vector x ∈ R n , we only incur the additional cost of adding the constant value c out = ω max n i x i to all the elements of the output vector.The cost of this additional operation is effectively of the order of n additions and 1 multiplication for the entire dot product operation, which is negligible as long as the number of rows is sufficiently large.
2) Convolution layers: A convolution operation can essentially be performed by a matrix-matrix dot product operation.The weight tensor containing the filter values would be represented as a (F n × (n ch m F n F ))-dimensional matrix, where F n is the number of filters of the layer, n ch the number of channels, and (m F , n F ) the height/width of the filters.Hence, the convolution matrix would perform a dot product operation with an ((n ch m F n F ) × n p )-dimensional matrix, that contains all the patches n p of the input image as column vectors.
Hence, in our experiments, we reshaped the weight tensors of the convolutional layers into their respective matrix forms and tested their storage requirements and dot product complexity by performing a simple matrix-vector dot product, but weighted the results by the respective number of patches n p that would have been used at each layer.For the dense algorithm, we implemented the standard 3 loop nest algorithm 1. for r idx = 1 < len(rowP tr) do   for w idx = r start + 1 < r end + 1 do     return Y

Fig. 1 :
Fig.1: Distribution of the weight matrix of the last layer of the VGG-16 neural network[40] after quantization.The respective matrix is 1000×4096 dimensional, transforming the 4096 lastlayer features onto 1000 output classes.We applied an uniform quantizer over the range of values, with 2 7 quantization points, which resulted in no loss of accuracy on the classification task.Left: Probability mass distribution.Right: Frequency of appearance of the 15 most frequent values.
) where b a denotes the bit-size of the elements of the input vector a ∈ R n and b o the bit-size of the elements of the output vector.The cost (2) is derived from considering 1) loading the elements of the input vector [γ(b a )], 2) loading the elements of the matrix [γ(b Ω )], 3) multiplying them [µ(b o )], 4) summing the multiplications [σ(b o )], and 5) writing the result [δ(b o )/n].

H
< l a t e x i t s h a 1 _ b a s e 6 4 = " v e n y S 6 W b 6 P z t Q 7 m R n 1 x i a V o A r o Y = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R S 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A n X m M z A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v e n y S 6 W b 6 P z t Q 7 m R n 1 x i a V o A r o Y = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R S 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A n X m M z A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v e n y S 6 W b 6 P z t Q 7 m R n 1 x i a V o A r o Y = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R S 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A n X m M z A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v e n y S 6 W b 6 P z t Q 7 m R n 1 x i a V o A r o Y = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R S 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y 9 6 s 3 F / 7 x e a s J b P + M y S Q 1 K t l w U p o K Y m M y / J k O u k B k x t Y Q y x e 2 t h I 2 p o s z Y b E o 2 B G / 1 5 X X S v q p 6 b t V r X l d q d 3 k c R T i D c 7 g E D 2 6 g B n V o Q A s Y I D z D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A n X m M z A = = < / l a t e x i t > p < l a t e x i t s h a 1 _ b a s e 6 4 = " y C 3 o / D L F g n g s j s I l 0 f h B V d l L N V U = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / o h 6 9 L B b B U 0 l E 0 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 H n f T m l t f W N z q 7 x d 2 d n d 2 z 9 w D 4 9 a J s k 0 4 0 2 W y E R 3 Q m q 4 F I o 3 U a D k n V R z G o e S t 8 P x 7 c x v P 3 F t R K I e c Z L y I K Z D J S L B K F r p I e 1 7 f b f q 1 b w 5 y 6 L 8 6 7 8 7 F o L T n F z D H 8 g f P 5 A / 3 t j Z c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y C 3 o / D L F g n g s j s I l 0 f h B V d l L N V U = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / o h 6 9 L B b B U 0 l E 0 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 H n f T m l t f W N z q 7 x d 2 d n d 2 z 9 w D 4 9 a J s k 0 4 0 2 W y E R 3 Q m q 4 F I o 3 U a D k n V R z G o e S t 8 P x 7 c x v P 3 F t R K I e c Z L y I K Z D J S L B K F r p I e 1 7 f b f q 1 b w 5 y 6 L 8 6 7 8 7 F o L T n F z D H 8 g f P 5 A / 3 t j Z c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y C 3 o / D L F g n g s j s I l 0 f h B V d l L N V U = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / o h 6 9 L B b B U 0 l E 0 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 H n f T m l t f W N z q 7 x d 2 d n d 2 z 9 w D 4 9 a J s k 0 4 0 2 W y E R 3 Q m q 4 F I o 3 U a D k n V R z G o e S t 8 P x 7 c x v P 3 F t R K I e c Z L y I K Z D J S L B K F r p I e 1 7 f b f q 1 b w 5 y 6 L 8 6 7 8 7 F o L T n F z D H 8 g f P 5 A / 3 t j Z c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y C 3 o / D L F g n g s j s I l 0 f h B V d l L N V U = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / o h 6 9 L B b B U 0 l E 0 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 H n f T m l t f W N z q 7 x d 2 d n d 2 z 9 w D 4 9 a J s k 0 4 0 2 W y E R 3 Q m q 4 F I o 3 U a D k n V R z G o e S t 8 P x 7 c x v P 3 F t R K I e c Z L y I K Z D J S L B K F r p I e 1 7 f b f q 1 b w 5 y

2 K
6 L 8 6 7 8 7 F o L T n F z D H 8 g f P 5 A / 3 t j Z c = < / l a t e x i t > H = log 2 p 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 b 8 i / b 8 q n j m 5 b S 3 m A O k l k d 8 i n l M = " > A A A B 9 X i c b V B N S w M x E J 3 1 s 9 a v q k c v w S J 4 s e w W Q S 9 C 0 U u P F e w H t O u S T b N t a D Z Z k q x S l v 4 P L x 4 U 8 e p / 8 e a / M W 3 3 o K 0P B h 7 v z T A z L 0 w 4 0 8 Z 1 v 5 2 V 1 b X 1 j c 3 C V n F 7 Z 3 d v v 3 R w 2 N I y V Y Q 2 i e R S d U K s K W e C N g 0 z n H Y S R X E c c t o O R 7 d T v / 1 I l W Z S 3 J t x Q v 0 Y D w S L G M H G S g 9 1 d I 3 O e 1 w O g m o S u E G p 7 F b c G d A y 8 X J S h h y N o P T V 6 0 u S x l Q Y w r H W X c 9 N j J 9 h Z R j h d F L s p Z o m m I z w g H Y t F T i m 2 s 9 m V 0 / Q q V X 6 K J L K l j B o p v6 e y H C s 9 T g O b W e M z V A v e l P x P 6 + b m u j K z 5 h I U k M F m S + K U o 6 M R N M I U J 8 p S g w f W 4 K J Y v Z W R I Z Y Y W J s U E U b g r f 4 8 j J p V S u e W / H u L s q 1 m z y O A h z D C Z y B B 5 d Q g z o 0 o A k E F D z D K 7 w 5 T 8 6 L 8 + 5 8 z F t X n H z m C P 7 A + f w B a d + R J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 b 8 i / b 8 q n j m 5 b S 3 m A O k l k d 8 i n l M = " > A A A B 9 X i c b V B N S w M x E J 3 1 s 9 a v q k c v w S J 4 s e w W Q S 9 C 0 U u P F e w H t O u S T b N t a D Z Z k q x S l v 4 P L x 4 U 8 e p / 8 e a / M W 3 3 o K 0P B h 7 v z T A z L 0 w 4 0 8 Z 1 v 5 2 V 1 b X 1 j c 3 C V n F 7 Z 3 d v v 3 R w 2 N I y V Y Q 2 i e R S d U K s K W e C N g 0 z n H Y S R X E c c t o O R 7 d T v / 1 I l W Z S 3 J t x Q v 0 Y D w S L G M H G S g 9 1 d I 3 O e 1 w O g m o S u E G p 7 F b c G d A y 8 X J S h h y N o P T V 6 0 u S x l Q Y w r H W X c 9 N j J 9 h Z R j h d F L s p Z o m m I z w g H Y t F T i m 2 s 9 m V 0 / Q q V X 6 K J L K l j B o p v6 e y H C s 9 T g O b W e M z V A v e l P x P 6 + b m u j K z 5 h I U k M F m S + K U o 6 M R N M I U J 8 p S g w f W 4 K J Y v Z W R I Z Y Y W J s U E U b g r f 4 8 j J p V S u e W / H u L s q 1 m z y O A h z D C Z y B B 5 d Q g z o 0 o A k E F D z D K 7 w 5 T 8 6 L 8 + 5 8 z F t X n H z m C P 7 A + f w B a d + R J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 b 8 i / b 8 q n j m 5 b S 3 m A O k l k d 8 i n l M = " > A A A B 9 X i c b V B N S w M x E J 3 1 s 9 a v q k c v w S J 4 s e w W Q S 9 C 0 U u P F e w H t O u S T b N t a D Z Z k q x S l v 4 P L x 4 U 8 e p / 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 0 w 4 0 8 Z 1 v 5 2 V 1 b X 1 j c 3 C V n F 7 Z 3 d v v 3 R w 2 N I y V Y Q 2 i e R S d U K s K W e C N g 0 z n H Y S R X E c c t o O R 7 d T v / 1 I l W Z S 3 J t x Q v 0 Y D w S L G M H G S g 9 1 d I 3 O e 1 w O g m o S u E G p 7 F b c G d A y 8 X J S h h y N o P T V 6 0 u S x l Q Y w r H W X c 9 N j J 9 h Z R j h d F L s p Z o m m I z w g H Y t F T i m 2 s 9 m V 0 / Q q V X 6 K J L K l j B o p v 6 e y H C s 9 T g O b W e M z V A v e l P x P 6 + b m u j K z 5 h I U k M F m S + K U o 6 M R N M I U J 8 p S g w f W 4 K J Y v Z W R I Z Y Y W J s U E U b g r f 4 8 j J p V S u e W / H u L s q 1 m z y O A h z D C Z y B B 5 d Q g z o 0 o A k E F D z D K 7 w 5 T 8 6 L 8 + 5 8 z F t X n H z m C P 7 A + f w B a d + R J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 b 8 i / b 8 q n j m 5 b S 3 m A O k l k d 8 i n l M = " > A A A B 9 X i c b V B N S w M x E J 3 1 s 9 a v q k c v w S J 4 s e w W Q S 9 C 0 U u P F e w H t O u S T b N t a D Z Z k q x S l v 4 P L x 4 U 8 e p / 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 0 w 4 0 8 Z 1 v 5 2 V 1 b X 1 j c 3 C V n F 7 Z 3 d v v 3 R w 2 N I y V Y Q 2 i e R S d U K s K W e C N g 0 z n H Y S R X E c c t o O R 7 d T v / 1 I l W Z S 3 J t x Q v 0 Y D w S L G M H G S g 9 1 d I 3 O e 1 w O g m o S u E G p 7 F b c G d A y 8 X J S h h y N o P T V 6 0 u S x l Q Y w r H W X c 9 N j J 9 h Z R j h d F L s p Z o m m I z w g H Y t F T i m 2 s 9 m V 0 / Q q V X 6 K J L K l j B o p v 6 e y H C s 9 T g O b W e M z V A v e l P x P 6 + b m u j K z 5 h I U k M F m S + K U o 6 M R N M I U J 8 p S g w f W 4 K J Y v Z W R I Z Y Y W J s U E U b g r f 4 8 j J p V S u e W / H u L s q 1 m z y O A h z D C Z y B B 5 d Q g z o 0 o A k E F D z D K 7 w 5 T 8 6 L 8 + 5 8 z F t X n H z m C P 7 A + f w B a d + R J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " P v A y K c u l K r n J 9 H 8 7 O H E 3 1 g M A 6 c I = " > A A A B 7 n i c b V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 h d 0 g 6 D H o R f A S w T w g W c L s Z J I 7 / 7 a 2 t b 2 x u b R d 2 i r t 7 + w e H p a P j p t W p Y b z B t N S m H V H L p V C 8 g Q I l b y e G 0 z i S v B W N b 2 d + 6 4 k b K 7 R 6 x E n C w 5 g O l R g I R t F J r a 7 U w 1 7 1 v l c q + x V / D r J K g p y U I U e 9 V / r q 9 j V L Y 6 6 Q S W p t J / A T D D N q U D D J p 8 V u a n l C 2 Z g O e c d R R W N u w 2 x + 7 p S c O 6 V P B t q 4 U k j m 6 u + J j M b W T u L I d c Y U R 3 b Z m 4 n / e Z 0 U B 9 d h J l S S I l d s s W i Q S o K a z H 4 n f W E 4 Q z l x h D I j 3 K 2 E j a i h D F 1 C R R d C s P z y K m l W K 4 F f C R 4 u y 7 W b P I 4 C n M I Z X E A A V 1 C D O 6 h D A x i M 4 R l e 4 c 1 L v B f v 3 f t Y t K 5 5 + c w J / I H 3 + Q P X G I 8 6 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " P v A y K c u l K r n J 9 H 8 7 O H E 3 1 g M A 6 c I = " > A A A B 7 n i c b V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 h d 0 g 6 D H o R f A S w T w g W c L s Z J I 7 / 7 a 2 t b 2 x u b R d 2 i r t 7 + w e H p a P j p t W p Y b z B t N S m H V H L p V C 8 g Q I l b y e G 0 z i S v B W N b 2 d + 6 4 k b K 7 R 6 x E n C w 5 g O l R g I R t F J r a 7 U w 1 7 1 v l c q + x V / D r J K g p y U I U e 9 V / r q 9 j V L Y 6 6 Q S W p t J / A T D D N q U D D J p 8 V u a n l C 2 Z g O e c d R R W N u w 2 x + 7 p S c O 6 V P B t q 4 U k j m 6 u + J j M b W T u L I d c Y U R 3 b Z m 4 n / e Z 0 U B 9 d h J l S S I l d s s W i Q S o K a z H 4 n f W E 4 Q z l x h D I j 3 K 2 E j a i h D F 1 C R R d C s P z y K m l W K 4 F f C R 4 u y 7 W b P I 4 C n M I Z X E A A V 1 C D O 6 h D A x i M 4 R l e 4 c 1 L v B f v 3 f t Y t K 5 5 + c w J / I H 3 + Q P X G I 8 6 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " P v A y K c u l K r n J 9 H 8 7 O H E 3 1 g M A 6 c I = " > A A A B 7 n i c b V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 h d 0 g 6 D H o R f A S w T w g W c L s Z J I 7 / 7 a 2 t b 2 x u b R d 2 i r t 7 + w e H p a P j p t W p Y b z B t N S m H V H L p V C 8 g Q I l b y e G 0 z i S v B W N b 2 d + 6 4 k b K 7 R 6 x E n C w 5 g O l R g I R t F J r a 7 U w 1 7 1 v l c q + x V / D r J K g p y U I U e 9 V / r q 9 j V L Y 6 6 Q S W p t J / A T D D N q U D D J p 8 V u a n l C 2 Z g O e c d R R W N u w 2 x + 7 p S c O 6 V P B t q 4 U k j m 6 u + J j M b W T u L I d c Y U R 3 b Z m 4 n / e Z 0 U B 9 d h J l S S I l d s s W i Q S o K a z H 4 n f W E 4 Q z l x h D I j 3 K 2 E j a i h D F 1 C R R d C s P z y K m l W K 4 F f C R 4 u y 7 W b P I 4 C n M I Z X E A A V 1 C D O 6 h D A x i M 4 R l e 4 c 1 L v B f v 3 f t Y t K 5 5 + c w J / I H 3 + Q P X G I 8 6 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " P v A y K c u l K r n J 9 H 8 7 O H E 3 1 g M A 6 c I = " > A A A B 7 n i c b V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 h d 0 g 6 D H o R f A S w T w g W c L s Z J I

Fig. 3 :
Fig. 3: Sketch of efficiency regions of the different data structures on the entropy-sparsity-plane (H denotes the entropy and p 0 the sparsity).A point in the plane corresponds to a distribution of the elements of a matrix, that has respective entropy-sparsity value.The intensity of the colors reflect the degree of the efficiency of the representations.More intense red regions indicate that the CER/CSER data structures are more efficient.Respectively, the colors blue and green indicate the degree of efficiency of the dense and sparse data structures.There are two lines that constrain the set of possible distributions.The bottom line corresponds to distributions whose entropy equal their respective min-entropy (that is, where H = − log 2 p 0 ).The second line (at the most right) to the family of spike-and-slab distributions.

Fig. 4 :
Fig.4: The plots show the most efficient data structure at different points in the H − p 0 plane (to recall, H denotes the entropy of the matrix and p 0 the probability of the 0 value).We compare the dense data structure (blue), the CSR format (green) and the proposed data structures (red).The colors indicate which of the three categories was the most efficient at that point in the plane.The proposed data structures tend to be more efficient in the down left region of the plane.In contrast, sparse data structures tend to be more efficiency in the upper right corner, whereas dense structures in the upper left corner.For this experiment we employed a 100 × 100 matrix and calculated the average complexity over 10 matrix samples at each point.The size of the set of the elements was 2 7 .

Fig. 5 :
Fig. 5: Efficiency ratios compared to the dense data structure of the different data representations.n denotes the column size.We chose a matrix with H = 4, p 0 = 0.55 and fixed row size of 100.The results show the averaged values over 20 matrix samples.The size of the set of the elements was 2 7 .The proposed data structures tend to be more efficient as the column dimension of the matrix increases, and converge to the same value for n → ∞.

Fig. 10 :
Fig. 10: Empirical distributions of the weight matrices of different neural network architectures after compression, displayed on the entropy-sparsity plane.As we see, most of the layers lay in the region where the CER/CSER data structures outperform the dense and sparse representation.The bottom and upper black line constrain the set of possible distributions.

Fig. 11 :
Fig.11: Efficiency comparison of a compressed AlexNet[45] after converting it's weight matrices into the different data structures and benchmarking their matrix-vector dot product operation.The network was compressed using the deep compression[26] technique.The plots show the over the layers aggregated results compared to the original dense data structure.

B
. More results from experiments Figures 12, 13 and 14 show our results for compressed ResNet152, VGG16 and AlexNet, respectively.

Fig. 12 :Fig. 13 :
Fig. 12: Efficiency results from a compressed ResNet152.The experimental details are as described in section V-B

Fig. 14 :Algorithm 1 yAlgorithm 2
Fig. 14: Efficiency results of a compressed AlexNet.The experimental details are as described in section V-C

13 :
w end ← wP tr[w idx ] for i = w start < w end do 16: I ← colI[i]

TABLE I :
[43]gy values (in pJ) of different elementary operations for a 45nm CMOS process[43].We set the 8 bit floating point operations to be half the cost of a 16 bit operation, whereas we linearly interpolated the values in the case of the read and write operations.

TABLE IV :
Statistics of different neural network weight matrices taken over the entire network.p 0 denotes the effective sparsity level of the network, H stands for the effective entropy, k represents the effective number of shared elements per row, and n denotes the effective column dimension.We see that all neural networks have relatively low entropy, thus relatively low number of shared elements compared to the very high column dimensionality.

TABLE VI :
Gains attained with regard to the number of operations, time and energy cost needed when benchmarking the matrix-vector multiplication of the weight matrices of the networks described in table II.The performance gains are relative to the original dense representation of the compressed weight matrices, and they display the over the layers aggregated results.