Arithmetic Coding-Based 5-bit Weight Encoding and Hardware Decoder for CNN Inference in Edge Devices

Convolutional neural networks (CNNs) have gained a huge attention for real-world artificial intelligence (AI) applications such as image classification and object detection. On the other hand, for better accuracy, the size of the CNNs’ parameters (weights) has been increasing, which in turn makes it difficult to enable on-device CNN inferences in resource-constrained edge devices. Though weight pruning and 5-bit quantization methods have shown promising results, it is still challenging to deploy large CNN models in edge devices. In this paper, we propose an encoding and hardware-based decoding technique which can be applied to 5-bit quantized weight data for on-device CNN inferences in resource-constrained edge devices. Given 5-bit quantized weight data, we employ arithmetic coding with range scaling for lossless weight compression, which is performed offline. When executing on-device inferences with underlying CNN accelerators, our hardware decoder enables a fast in-situ weight decompression with small latency overhead. According to our evaluation results with five widely used CNN models, our arithmetic coding-based encoding method applied to 5-bit quantized weights shows a better compression ratio by 9.6× while also reducing the memory data transfer energy consumption by 89.2%, on average, as compared to the case of uncompressed 32-bit floating-point weights. When applying our technique to pruned weights, we obtain better compression ratios by 57.5×–112.2× while reducing energy consumption by 98.3%–99.1% as compared to the case of 32-bit floating-point weights. In addition, by pipelining the weight decoding and transfer with the CNN execution, the latency overhead of our weight decoding with 16 decoding unit (DU) hardware is only 0.16%–5.48% and 0.16%–0.91% for non-pruned and pruned weights, respectively. Moreover, our proposed technique with 4-DU decoder hardware reduces system-level energy consumption by 1.1%–9.3%.


I. INTRODUCTION
Recently, convolutional neural networks (CNNs) have been widely deployed in many artificial intelligence (AI) applications. Due to the advancements of the processing units (e.g., central processing units (CPUs), graphics processing unit (GPUs), neural processing units (NPUs), etc.), CNNs are lately being executed on-device. Furthermore, CNNs are recently being employed even in embedded resourceconstrained Internet-of-Things (IoT) devices for fast and efficient inferences for vision-based AI applications. However, the huge weight (parameter) size of the CNNs is one of the major hurdles for deploying the CNNs in resourceconstrained edge devices. For example, ResNet-152 [1] and VGG-16 [2], which are very widely used CNN models, have weight sizes of 235MB and 540MB, respectively (in 32bit floating-point precision). In addition, for improving the accuracy of the CNN models, CNN models and parameter sizes are expected to be further increased. Though the large weight size causes several important challenges for CNN deployment in resource-constrained devices, one of the most serious problems is limited memory (and/or storage size) of these devices. The large weight data cannot be often fully loaded into the small memory and storage of the resourceconstrained devices. In addition, the large weight size inevitably causes latency, power, and energy overhead when transferring the weight data between the storage/memory and the processing units, such as CPUs, GPUs, NPUs, or accelerators, which is not desirable for resource-constrained edge devices.
In order to resolve these problems, several well-known techniques such as weight pruning [3] and quantization [4] have been introduced. The weight pruning increases the sparsity of the weight data by replacing the near-zero weight elements with zero-valued elements. The quantization reduces weight elements' size with a negligible loss of the CNN accuracy. For example, while the conventional precision of the elements in CNNs is single-precision floating-point (32-bit), 8-bit integer and 16-bit fixed-point precisions are also widely used for cost-efficient CNN inferences, which can reduce the weight size by 4× and 2×, respectively. Even further, as a more aggressive solution, several works have proposed to use 5-bit weight elements for deploying the CNN models in resource-constrained systems [5] [6]. Though, these works have shown successful results on reducing the weight data size, we could further reduce the weight size by applying the data encoding schemes such as Huffman coding or arithmetic coding. By only storing the encoded (hence, the reduced size of) weight data in device's memory and/or storage, we could enable more cost-efficient deployment of the CNN models in resource-constrained devices.
In this paper, we introduce an arithmetic coding-based 5bit quantized weight compression technique for on-device CNN inferences in resource-constrained edge devices. Once the weight elements are quantized to a 5-bit format (quantization can be done by other methods such as [6]), we leverage arithmetic coding for weight compression, which has been generally employed for entropy-based data compression (e.g., for image compression in [7] and [8]). In addition, we also employ range scaling for lossless compression, meaning that there is no accuracy loss in CNN inferences as compared to the case of using uncompressed 5-bit weight element. Compared to Huffman coding-based compression, which is commonly used, our arithmetic coding-based technique leads to better compression ratio, resulting in less memory and storage requirements for weights. In addition, when applying our technique to the pruned weights, one can obtain much higher compression ratio as compared to Huffman codingbased weight compression. For an in-situ weight decompression for edge devices which contain a CNN accelerator or NPU, we propose a hardware decoder which can decompress the compressed weight with a small latency overhead.
We summarize our contributions as follows: • We introduce a lossless arithmetic coding-based 5-bit quantized weight compression technique; • We propose a hardware-based decoder for in-situ decompression of the compressed weights in the NPU or CNN accelerator, and also implement our hardwarebased decoder in field-programmable gate array (FPGA) as a proof-of-concept; • Our proposed technique for 5-bit quantized weights reduces the weight size by 9.6× (by up to 112.2× in the case of pruned weights) as compared to the case of using the uncompressed 32-bit floating-point (FP32) weights; • Our proposed technique for 5-bit quantized weights also reduces memory energy consumption by 89.2% (by up to 99.1% for pruned weights) as compared to the case of using the uncompressed FP32 weight; • When combining our compression technique and hardware decoder (16 decoding units) with various state-ofthe-art CNN accelerators [9] [10] [11], our technique incurs a small latency overhead by 0.16%-5.48% (0.16%-0.91% for pruned weights) as compared to the case without our proposed technique and hardware decoder. • When combining our proposed technique with various state-of-the-art CNN accelerators [9] [10], our proposed technique with 4 decoding unit (DU) decoder hardware reduces system-level energy consumption by 1.1%-9.3% as compared to the case without using our proposed technique.

II. RELATED WORK
There have been many works for weight size reduction or compression. In [3], Han  In order to minimize accuracy losses from the JPEG encoding, it adaptively controls a quality factor according to error sensitivity. It achieves a 42× compression ratio for multilayer perceptron (MLP)-based network with an accuracy loss of 1%. In [14], Ge et al. have proposed a framework that reduces weight data size by using approximation, quantization, pruning, and coding. For the coding method, the framework encodes only non-zero weights with their positional information. It is reported that the framework proposed in [14] shows a compression ratio of 21.9× and 22.4× for AlexNet and VGG-16, respectively. In [15], Reagan et al. have proposed a lossy weight compression technique that exploits Bloomier filter and arithmetic coding. Due to the probabilistic nature of Bloomier filter, it also retrains weights based on lossy compressed weights. It shows a compression ratio of 496× in the first fully connected layer of LeNet5. However, it also shows an accuracy loss of 4.4% in VGG-16 top-1 accuracy. In [16], Choi et al. have proposed a universal DNN weight compression framework with lattice quantization and lossless coding such as bzip2. By leveraging the dithering which adds a randomness to the sources, the framework proposed in [16] can be universally employed without the knowledge on source distribution. In [17], Young et al. have proposed a transformation-based quantization method. By applying the transformation before the quantization, it achieves a significantly low bit rate around 1-2 bits per weight element. As explained in [17], the quantization method is orthogonal to our compression technique, meaning that the transform quantization and our technique can be deployed synergistically. Although our technique is geared towards 5-bit quantized weight compression, our arithmetic coding-based compression/decompression can be extended to N -bit weight elements.
The weight size reduction techniques introduced so far have mostly been focusing on 8-bit, 16-bit, or 32-bit precision weights, meaning that the weight size reduction for 5-bit weights has been largely overlooked. On the contrary, our work is based on 5-bit quantized weights which is more suitable for resource-constrained edge devices. Moreover, for fast in-situ decompression of the weights, we also propose a novel hardware decoder, which has not been introduced yet.

A. 5-BIT WEIGHT QUANTIZATION FOR CNNS
Recently, for deploying the CNNs in tightly resourceconstrained systems, several researches have been focusing on reducing the weight element size even less than 8-bit integer. One of the representative approaches is 5-bit quantization, which represents each weight element with either 5-bit log-based or 5-bit linear-based one. Between the logand linear-based quantization, the log-based representation is more preferred. This is because (1) log-based approach can achieve a better accuracy as compared to the linearbased approach [5], and (2) log-based representation is more hardware-friendly as it can replace multiplier with shifter [18]. It has been reported that the log-based representation shows very small accuracy drops (e.g., 1.7% and 0.5% top-5 accuracy drop in AlexNet and VGG-16, respectively).

B. ENTROPY-BASED CODING
The entropy-based coding is a very widely used data compression scheme. It adopts a variable length encoding based on the occurring probability of a certain symbol in a datastream. Based on the probability, it assigns different code lengths for encoding of the datastream. There are two widely used entropy-based coding schemes: Huffman coding and arithmetic coding. When encoding the data with Huffman coding, one generates Huffman tree (a type of binary tree) which represents how data is encoded for each symbol based on the occurring probabilities of each symbol. When encoding the data, one performs a tree traversal from the root node until the symbol is found from the tree. On the other hand, arithmetic coding encodes the data by mapping a stream of the symbols into the real number space [0, 1). As introduced in Section II, several weight compression approaches based on Huffman coding have been introduced [4] [12] because of its simplicity whereas little attention is paid to arithmetic coding. However, in general, arithmetic coding is known to result in a better compression ratio as compared to Huffman coding [19]. Hence, in our work, we employ arithmetic coding in order to compress the 5-bit quantized weight data to aim at better compression ratio than Huffman coding.

IV. WEIGHT COMPRESSION WITH HARDWARE-BASED DECODING A. ARCHITECTURE AND DESIGN OVERVIEW
For a real-world deployment of our weight compression technique, we introduce a system overview and execution flow to support fast and cost-efficient weight encoding/decoding. The overall architecture and execution flow of our proposed technique are illustrated in Figure 1. First, the weight quantization and encoding (upper part of Figure 1) are performed offline. In the cloud (or datacenter), the trained 32bit FP weight elements 1 can be quantized to 5-bit format (and can also be pruned), and compressed by using our arithmetic coding-based compression technique, and sent to resource-constrained edge devices for CNN inference. The compressed weight data is stored in the memory or storage of the edge devices and will be accessed when running the CNN inference. We assume there is a CNN accelerator (or NPU) in the edge device because recent edge devices are widely adopting CNN accelerators (e.g., edge tensor processing unit in Google Coral platform [20]). Please note that the 5-bit quantization can be performed with any already proposed technique (e.g., [4], [21], and [22]), and thus the proposal of a 5-bit quantization method is outside the scope of this paper. A clear description of our contributions and execution flow is depicted in Figure 2.
In case of CNN inference in the baseline system (i.e., without our technique), the weight data will be directly loaded into CNN accelerator's private local memory (PLM).    In this case, non-compressed 5-bit weight data is fully or partially loaded into CNN accelerator's PLM. However, with our proposed technique, before we send the weight data to the CNN accelerator, we need a fast in-situ decompression (i.e., decoding) of the compressed weight data, which necessitates a hardware decoder. The hardware decoder receives the bitstream of the compressed weight and generates the original 5-bit quantized weight data. In the hardware decoder, there are multiple decoding units (DUs) to expedite the runtime decoding process. Please note that our decoding hardware does not have any dependency to the CNN accelerator that can perform convolution operations with 5-bit quantized weight.

1) Algorithm Overview
In this subsection, we describe an overview of our algorithm. Figure 3 summarizes the comparison between the theoretical arithmetic coding-based encoding (a), binary arithmetic coding-based encoding without range scaling (b), and our binary arithmetic coding-based encoding with range scaling (c) and decoding (d). As shown in Figure 3 (a), arithmetic coding encodes the original data into a certain real number between 0 and 1 ([0, 1)). Theoretically, the arithmetic coding can compress any raw data (i.e., a sequence of the symbols) to one real number since we can have an infinite number of real numbers between 0 and 1. In general, however, when we encode certain data to a bitstream, the encoded data should be mapped to a finite binary number space, which can cause underflow. In this case, the encoded data may be lossy because different binary data can be mapped to the same encoded binary due to the limited space for data encoding. As shown in Figure 3 (b), if we have a long sequence of the weight elements and we do not have a sufficiently large binary mapping space, the underflow can occur. This is because the feasible mapping space will become smaller and smaller as more weight elements are encoded. In our proposed technique, we also map the weight elements into the unsigned integer-based binary mapping space. Since our compression (i.e., encoding) technique aims to generate losslessly encoded data, we also employ a range scaling method that can adaptively scale the range of the binary mapping space. In the case of encoding ( Figure 3 (c)), depending on the mapped sub-range for a certain element, we adaptively scale this sub-range according to the range scaling condition (we explain the details on the scaling condition in the next subsection). In this case, we record the scaling information as well as the information on the mapped and scaled sub-range to the compressed bitstream. In the case of decoding ( Figure 3 (d)), by referring to a sliding window (gray shaded area in Figure 3 (d)) within a bitstream, we find which sub-range the unsigned integer (the number Z converted from the binary in the sliding window) belongs to. By referring to the found sub-range, we decode a weight element to which the sub-range corresponds. By shifting the sliding window 2 , we decode the next weight element in the similar way we describe above.

2) Weight Encoding Algorithm
In this subsection, we explain how we compress (i.e., encode) the 5-bit quantized weight data in details. Figure 4 shows a pseudocode of our arithmetic coding-based weight encoding with range scaling. For input, 5-bit quantized weight data (W ), the occurrence probabilities for each weight value (P ROB), and the number of weight elements (K) (i.e., the number of parameters) are required. To obtain P ROB, we can calculate the probability based on how many times each weight appears in the weight data. The output is an encoded weight bitstream (BS). Prior to encoding, we need to set variables: N means the number of bits for mapping the weight data to an unsigned binary with arithmetic coding. RS contains range scaling information, which is initially set to 0. The M AX, HALF , and QT R are calculated according to the N . For initialization, low and high are first set to 0 and M AX, respectively. We also need to collect the cumulative probabilities for each weight value 3  For illustrative purpose, we assume that the unsigned integer binary is 8-bit size (thus, mapping space is 0-255). In this example, we encode or decode a sequence of symbols (weight elements) "A", "B", "C", and "D". In (d), the numbers specified for the bitstream and Z values are merely exemplary values.

Pseudocode for Encoding (Compression)
The encoding procedure is performed by per weight element basis. For each weight element (from 0 to K − 1), we set the high and low values to a range for mapping a certain weight element to an unsigned integer binary number space (lines 2-4 in Figure 4). We call this range ([low, high]) as sub-range. When mapping the elements in the sub-range with range scaling, there can be three cases of the range scaling: (a) upper scaling, (b) lower scaling, and (c) middle scaling. These three cases of the range scaling are also shown in Figure 5. Lines 5-7 and 10 in Figure 4 correspond to the case (a) in Figure 5  followed by RS-bits of 0 2 s in the bitstream (for example, if RS=3, we write (1000) 2 to BS), reset RS to 0, and update low and high values by following the upper scaling rule for further range scaling. The reason why we write 1 2 in the upper scaling case is that the sub-range is in the upper half of the binary mapping space, meaning that the most-significant bit (MSB) of the mapped unsigned integer binary value will be 1 2 . RS-bits of 0 2 s incorporate the information on how many middle scaling has been done before. Performing more middle scalings implies that the original sub-range (i.e., before performing the upper and middle scalings) was closer to the center point of the range (Figure 6 (a)). This is why we write more number of 0 2 s as we perform more middle scalings in the previous loop iteration. In case (b) in Figure  5 (lower scaling: lines 5 and 8-10 in Figure 4), we write 0 2 (MSB of the mapped unsigned integer binary value will be 0 2 ) followed by RS-bits of 1 2 s in the bitstream, reset RS to 0, and update low and high values by following the lower scaling rule for further range scaling. As depicted in VOLUME 4, 2016 Figure 6 (b), the reason why we write 0 2 and RS-bits of 1 2 s can be explained in a similar way of the upper scaling case. In the case (c) in Figure 5 (middle scaling: lines 11-12 in Figure 4), we do not encode the element data and only scale the sub-range by following the middle scaling rule as long as the condition of the middle scaling case (line 11 in Figure 4) is met while we also keep incrementing RS to track how many times of middle scaling has been done.
There can be a case where the sub-range does not belong to any of the three scaling cases. In this case, we do not scale the sub-range and we move onto the next iteration (i.e., next iterations of the for loop or terminate the for loop in the case of the last loop iteration). This procedure (lines 1-13 in Figure 4) is repeatedly performed until the entire W elements are encoded to BS. After that, lines 14-18 in Figure  4 perform a bitstream writing that corresponds to the last part of the weight data which has not been encoded yet. In this part, we do not have a middle scaling case and only record the bits by following either upper or lower scaling rule. Although our technique maps weight elements to an N -bit unsigned integer binary number space with arithmetic coding, the encoded bitstream also contains range scaling information (RS) along with iteratively appended binary bits depending on the scaling condition (i.e., lower and upper scaling: Lines 6-9 in Figure 4). Thus, the encoded bitstream is typically much longer than N -bits.

3) Weight Decoding Algorithm
An overall structure of the decoding (i.e., decompression) procedure is very similar to that of the encoding procedure. The decoding procedure exactly follows the range calculation and scaling while we perform the weight element mapping with a part of the bitstream, which is an inverse operation of the encoding procedure. Figure 7 presents a pseudocode for the decoding. We need encoded weight bitstream (BS), the occurrence probabilities for each weight value (P ROB), and the number of original weight elements (K) as inputs while the output of the decoding procedure is original 5-bit weight elements (W ). We omit the explanation for variable initialization because the variable setting is same as the encoding except for Z(idx) which corresponds to the Nbits starting from the bit index idx in the BS where N is the number of bits used for mapping the encoded data to an unsigned binary number space, which is same as in the encoding. For example, if N =8, the Z(idx) will be the 8bits starting from the bit index 'idx', which can also be represented by an unsigned integer number ranging from 0 to 255 (= 2 8 − 1). Please note that Z(idx) shifts from the starting point of the encoded bitstream in a sliding window manner as we explained in Section IV-B1. The decoding procedure is performed until the encoded bitstream (BS) is fully decoded into original 5-bit weight elements (W ). We decode the bitstream by a unit of Nbits (Z(idx)). Here, the bit index of the bitstream begins with 0 (idx=0). Thus, the initial N -bit window will be Z(0). Before we perform the iterations, we need to append the N − 1 bits of 0 2 s at the end of BS in order to enable the decoding of the last N − 1 bits in the BS (line 1 in Figure 7). As in the encoding procedure, decoding is also performed by per weight element basis (from 0 to K − 1: line 2 in Figure 7). In lines 3-8, we calculate the sub-range ([low, high]) for each weight value (i.e., from 0 to 31) and find which weight value's sub-range contains Z(idx). If the subrange of weight value j contains Z(idx), we write j to the decoded weight W [i]. Similar to the case of the encoding, we then consider the three different range scaling cases: upper, lower, and middle scaling. In the cases of upper and lower scaling (lines 10-13 in Figure 7), we also scale the sub-range ([low, high]) by following the corresponding scaling rule as in Figure 5 (a) and (b), respectively, while also updating Z(idx). Please note that the Z(idx) can be regarded as a pointer that indicates the N -bits in the BS with a starting index of idx. Thus, updating the Z(idx) also means updating N -bits in the BS starting at the bit-index idx, also affecting the following Z(idx + 1) value which will be referenced next (i.e., referenced by shifting the sliding window). In the  middle scaling case (lines 14-16 in Figure 7), we scale the sub-range as in Figure 5 (c) while also updating Z(idx). As in the encoding procedure, for three scaling cases, the scaling procedure is repeatedly performed (in a while loop) as long as the scaling condition is satisfied. Once K weight elements are all decoded, the for loop (lines 2-17 in Figure 7) terminates and the decoding process is completed. Figure 8 shows an illustrative example of our arithmetic coding-based weight encoding. For brevity, we limit the types of weight values (i.e., possible weight element values) as "0", "1", and "2" (originally At first, we encode the first weight element "0". By calculating the sub-range ([low, high]) of the weight element "0", a new sub-range will be [0, 102] (=[0+ 255*0.0 , 0+ 255*0.4 ]) which corresponds to the case of lower scaling (high < HALF ). In this case, we write a bit 0 2 in the output bitstream (RS is currently 0, thus we do not write 1 2 ). We then scale the lower bound and upper bound of the sub-range by 2× by following the lower scaling rule. A new scaled range will be [0, 204] which does not account for any of the three scaling cases. Thus, we move to the next iteration for encoding the second weight element "1". The next sub-range will be [81, 163] (=[0+ 204*0.4 , 0+ 204*0.8 ]), which corresponds to the case of the middle scaling. In this case, we increment RS by 1 and scale the range by following the middle scaling rule, which makes the current range [34,198] [12,96]. It accounts for the lower scaling case, resulting in writing 0 2 to the BS. By following the lower scaling rule, we obtain a new scaled range of [12*2, 96*2]= [24,192], which does not correspond to any of the three cases, terminating the main loop (lines 1-13 in Figure 4). Since we have already encoded all the weight elements, we need to process the last part of the encoding (lines 14-18 in Figure 4). After we increment RS by 1, we write (01) 2 to the BS because the lower bound of the current range (low) is 24 which is less than QT R (64). Finally, we obtain the encoded bitstream of (001101001) 2 . Figure 9 demonstrates an example of the weight decoding. Please note that the N , M AX, HALF , and QT R values are equal to the encoding example. Before starting the first iteration, we append 7-bits of 0 2 s at the end of BS. In this example, we will perform five iterations because K, the number of weight elements encoded in the BS, is 5.    are always 0 and 1, respectively, meaning that they do not need to be stored in the table). The decoded weight buffer will store the decoded weight elements. From the bitstream buffer, we send the Z(idx) (where idx is the starting bit index of the N -bits) to the range scaling unit which performs the range scaling according to the three cases (upper, lower, and middle scaling). After that, the scaled ranges from the range scaling unit and probability values ( Figure  7) are sent to the range calculation unit, which performs calculations of a new sub-range for each weight value. In our design, we perform 32 sub-range calculation (corresponds to the lines 4-6 in Figure 7) in parallel, improving the performance of our decoder. In the comparator, the Z(idx) value and each sub-range are also compared in parallel in order to figure out which weight value should be written to the output in the current iteration. Once the comparison is done, we store the corresponding element to the decoded weight buffer (lines 7-8 in Figure 7). This process is iteratively performed according to the orchestration of the control logic. For proof-of-concept of our decoding hardware 5 , we have implemented our hardware in a field programmable gate array (FPGA) board (Xilinx ZCU106). We have used Xilinx Vivado design suite to implement our decoder design. We have synthesized our design for 150MHz with 16 DUs (the maximum number of DUs which our FPGA chip can accomodate), which results in 16× higher throughput than the 1-DU decoder implementation. To utilize 16 DUs, we divide the weight elements into 16 chunks as evenly as possible and encode each chunk into a separate bitstream. When decoding the bitstreams, we send each bitstream to each of the 16 DUs. Although our prototype is implemented with 16 DUs, the number of DUs is one of the design parameters which can be flexibly determined by the designer considering the available hardware resources (we will demonstrate the latency versus resource usage trade-off in Section V-C). If we implement our decoding hardware with application-specific integrated circuits (ASICs), we could implement more number of DUs, resulting in a better decoding throughput. According to our implementation, our single DU hardware takes 6.45 clock cycles per one weight element decoding, on average, though the total number of decoding cycles depends on the pattern and sequence of the weight elements in the bitstream.

4) Encoding and Decoding Examples
When executing the CNN inference, the weight decoding must be performed prior to the convolution operations. Without hiding the decoding latency, the latency overhead would be non-negligible, which is not desirable. To minimize the decoding latency, we can overlap the transfer and decoding latency for weights in the i-th CNN layer with the (i − 1)-th CNN layer execution latency in the CNN accelerator (illustrated in Figure 11). In this case, the latency overhead will be only decoding latency of the weights for the first CNN layer (where the weight decoding latency cannot be overlapped) as long as the following two conditions are satisfied. First, the input and output feature maps are reused across the CNN layers in the CNN accelerator (hence, the input and output feature maps do not need to be transferred between the memory and the PLM of the accelerator). Second, data transfer and decoding latency of CNN layer i are fully hidden by the execution time of CNN layer i−1 where integer i > 1.
To satisfy the second condition, more DUs in the decoding hardware can be desirable. This is because the decoding latency can be further reduced as we have more DUs in the decoder.

V. EVALUATION RESULTS
We show our evaluation results in terms of three metrics: compression ratio, energy consumption in the main memory, and latency overhead. For benchmarks, we use five CNN models: Network-in-Network (NiN) [23], SqueezeNet [24], GoogleNet [25], AlexNet [26], and CaffeNet [27]. We use 32 for N in our evaluations.
We have used the trained 32-bit floating-point (FP32) weights provided by Caffe framework [27]. For 5-bit quantization, we have employed an incremental network quantization (INQ) [6] method to generate 5-bit power-of-two (i.e., utilizing binary logarithm or logarithm to the base 2) quantized weights from the 32-bit full-precision weights. The main reason we choose INQ for our baseline is that it shows comparable or even better accuracy with 5-bit quantized weights as compared to 32-bit full-precision weights when running CNN inferences. Though we use a specific method (INQ) for 5-bit power-of-two quantization, our technique can be deployed with any 5-bit quantization method. For AlexNet and CaffeNet, we have additionally employed weight pruning (we used [28] and [29] for weight pruning of AlexNet and CaffeNet, respectively) to figure out how weight pruning affects the compression ratio. Although the (2) pruning and 5-bit quantization (denoted as P+Q). Firstly, we present compression ratio (Section V-A), memory energy consumption (Section V-A), and latency overheads for CNN inference (Section V-B). We only compare the compression ratio and memory energy consumption for CNN convolutional (CONV) layers while excluding the fully connected (FC) layers. For latency overheads, we assume that we only compress CONV layers' weights while the weights of the other layers such as FC layers are not compressed. In this case, we do not have the latency overheads of the layers other than the CONV layers because it does not need to be decompressed. Although early CNN models have a large number of weights in FC layers (e.g., AlexNet [26]), recent CNN models have FC layers only in the last layer of the CNN model (e.g., ResNet [1]), meaning that the CNN layer architecture is mostly composed of CONV layers. For deployment of our proposed technique to highly resourceconstrained systems, we also demonstrate a trade-off between the CNN inference latency and hardware resource usage (Section V-C). In addition, we further show a systemlevel energy consumption for a CNN inference in highly resource-constrained systems (Section V-D).

A. COMPRESSION RATIO AND MEMORY ENERGY CONSUMPTION
The compression ratio (CR) is defined as the ratio of uncompressed data size S u to the compressed data size S c , that is, CR = S u /S c . Table 1 summarizes the compression ratio across five CNN models. In Table 1, 32-bit, 16-bit, 8-bit, and 5-bit correspond to the cases of uncompressed 32-bit FP, 16bit quantized, 8-bit quantized, and 5-bit quantized weights,
In the case of only quantization (Q), our AC-5bit reduces the weight data size by 9.6× as compared to the uncompressed 32-bit weight data. In addition, our technique reduces the weight data size by 29.8%-34.9% as compared to the uncompressed 5-bit quantized weight data size. It means our proposed technique can significantly reduce the required storage and memory. In addition, our arithmetic coding-based compression technique shows better compression ratio as compared to the Huffman coding-based compression technique. On average, when compressing the 5-bit quantized weight data, our technique results in 3.7% (up to 7.2%) better compression ratio compared to the Huffman coding-based technique (HC-5bit). Moreover, our arithmetic coding-based compression technique shows near-optimal compression ratios. As compared to IE-5bit, our AC-5bit shows only 0.1% less compression ratio, on average.
In the case of pruning and quantization (P+Q), our AC-5-bit obtains 57.5× and 112.2× higher compression ratio over 32-bit FP for AlexNet and CaffeNet, respectively. The pruning significantly increases the number of zero-valued elements in the weights, which also significantly contributes to the compression ratio. On the other hand, HC-5-bit obtains 26.7× and 28.8× (as compared to the 32-bit FP) better compression ratios for AlexNet and CaffeNet, respectively. However, our AC-5bit results in 2.2× and 3.9× better compression ratios for AlexNet and CaffeNet, respectively, as compared to HC-5bit. As shown in the results (Table 1), our technique shows more promising results when applying the weight pruning which is a widely used technique for CNN inferences.
We also compare our work with other state-of-the-art techniques in terms of the compression ratio when using AlexNet [26]. Though the comparison in Table 1 is based on the compression of the weight in the CONV layers, for a fair comparison, we have compared our technique to other techniques based on the compression of the weights in the entire layers including the fully-connected (FC) layers. Before applying our compression technique, the entire CONV and FC layers are pruned by [28] and quantized to 5-bit format by [6]. As shown in Table 2, our technique shows better compression ratio as compared to the techniques or methods presented in [4], [12], and [14]. As compared to the method presented in [16], however, our technique shows a little lower compression ratio by 24.7%. In [16], the bzip2 compression is additionally applied to pruned and quantized weights. In this case, there would be a non-negligible latency overhead for decompressing the compressed weights during the runtime CNN inference. Please note that without bzip2 compression in [16], the compression ratio of our technique is better than that of [16] by 14.2%. Since our work has also proposed a hardware decoder and pipelining which minimizes latency overhead of runtime weight decompression, our technique would be more practical and suitable for realworld deployment.
Higher weight compression ratio translates into less memory energy consumption when transferring the weight data to the PLM of the CNN accelerator or the NPU. Table 3 presents memory data transfer energy comparison results when the device uses LPDDR4 (5pJ per bit [31] and 0.43W static power [32]) dynamic random access memory. We have estimated the dynamic energy by multiplying the accessed data size (in bits) by the per-bit access energy and static energy by multiplying the static power by the memory transfer time. The total estimated energy consumption is a summation of the dynamic and static energy. When only employing the 5bit quantization (Q), our AC-5bit reduces the memory data transfer energy by 89.2% as compared to the case of using uncompressed 32-bit FP weight data. Our AC-5bit also leads to 36.4% and 3.4% less memory energy consumption for weight transfer as compared to the cases of uncompressed 5bit weight and HC-5bit, respectively. When employing both 5-bit quantization and pruning (P+Q), our AC-5bit results in 98.3% and 99.1% less memory energy consumption for AlexNet and CaffeNet, respectively, as compared to the case of 32-bit FP weights. Even compared to HC-5bit, our AC-5bit reduces memory energy consumption by 53.6% and 74.3% for AlexNet and CaffeNet, respectively. Considering that the energy consumption in memory-related parts accounts for a large portion in resource-constrained edge systems [33], our proposed technique will enable more energy-efficient resource-constrained edge devices.

B. LATENCY OVERHEAD
Our technique incurs decoding latency before the weight data is loaded to the accelerator on-chip memory. As explained in Section IV-C, the decoding latency can be hidden by  Table 4). We present the latency overhead results when our proposed decoding hardware ( Figure 10) is combined with the FPGA-based CNN accelerators because our decoding hardware prototype is implemented and verified in an FPGA. There is also a CNN accelerator [11] with 8-bit precision in Table 4 while our main target for weight compression is 5-bit quantized weights. Though we have presented our technique for 5-bit weights, our arithmetic coding-based encoding and decoding technique can also be used along with 8-bit precision-based CNN accelerators. In this case, we only compress 5-bits within each 8-bit weight element by using our arithmetic coding while the remaining bits (i.e., 3-bits) of the weights remain uncompressed (the remaining 3-bits can be transferred directly from the memory to the CNN accelerator without passing through the hardware decoder). In this case, we will have a lower compression ratio, and a higher weight decoding and transfer latency as compared to the case of 5-bit weight compression due to the uncompressed part in each weight element. Nonetheless, to present the versatility of our technique, we also present the latency overhead results when adopting our technique with the 8-bit precision accelerator [11] in Table 4.
When using our proposed decoding hardware and various CNN accelerators without the latency hiding, the latency overhead is 13.26%-42.94%. On the other hand, with the latency hiding 7 , our proposed technique without pruning shows 0.16%-5.48% latency overheads when performing the CNN inferences, implying that the latency overhead from the decoding hardware is small. In the case of our proposed technique with pruning and latency hiding, the latency overheads are almost negligible (0.16%-0.91%). When focusing on the case with 8-bit precision accelerator [11] in Table  4, the latency overhead is only up to 1.40%, implying that our technique can also be deployed with 8-bit precision accelerator with a negligible latency overhead.
The huge latency overhead reduction when applying the pruning is attributed to the reduced weight data size with arithmetic coding (due to the increase in the number of zerovalued weight elements), resulting in quicker decoding and shorter weight transfer latency. Considering that the main focus of our technique is resource-constrained edge devices, this small latency overhead is sufficiently acceptable as the benefits from the reduced memory and storage requirement and reduced memory energy consumption are much greater than the latency overhead.

C. LATENCY VERSUS RESOURCE USAGE TRADE-OFF
For systems or devices under tight resource constraints, we also present the latency versus resource usage trade-off when employing 2-DU, 4-DU, 8-DU, and 16-DU decoders 8 . The 2-DU, 4-DU and 8-DU designs require much less hardware resources than the 16-DU design. Thus, the 2-DU, 4-DU, and 8-DU designs can be suitable for small or tiny embedded edge devices. However, the smaller number of DUs will lead to a higher decoding latency overhead, which also results in increased CNN inference latency. As shown in Figure 12, in the case of the 4-DU and 8-DU decoders, performance overheads without pruning can be up to 34.2% and 8.73%, respectively, whereas the performance overheads with pruning can be up to 31.4% and 2.77%, respectively, even with the latency hiding. With the 2-DU decoder, which can be deployed for the systems with extremely stringent resource constraints, the latency overhead can be up to 126.1% without pruning and 108.0% with pruning. The reason why the decoding time overhead seems to be large when used with the CNN accelerator in [9] is that the baseline CNN inference latency in [9] is very small, which makes the latency overhead from our decoder relatively large. For the decoding overhead with the CNN accelerator in [11], even though the baseline inference latency of [11] is higher than that of [10], the latency overhead is larger as compared to the case of [10]. This is because the CNN accelerator in [11] uses the 8-bit precision accelerator, which implies that the compression ratio will be worse in [11]   optimized for 5-bit weight encoding (i.e., 5-bits within 8-bits element are compressed while the remaining 3-bits remain uncompressed). This results in relatively high transfer latency when using the CNN accelerator in [11] with our decoder.
In the cases of 8-DU and 16-DU with the CNN accelerator in [11], the transfer latency and decoding latency can be mostly hidden by the CNN layer processing time. However, in the cases of 2-DU and 4-DU with the CNN accelerator in [11], the transfer latency and decoding latency cannot be hidden by the CNN layer processing time, leading to the large latency overhead. For [9], though relative decoding time overhead can be large, the absolute inference latency is negligibly affected (+9.54ms and +2.77ms with 2-DU and 4-DU decoders, respectively) as shown in Table 5. For [11], the decoding time overheads with 2-DU and 4-DU decoders can be decreased if we use 5-bit precision CNN accelerators.
In typical edge devices, the baseline CNN inference latency will not be very small. This is because the CNN accelerator performance will be limited due to the tight hardware resource constraints. In addition, satisfying the deadline of the response time (i.e., latency) is more important in edge or embedded systems, which implies that the increased latency overhead is acceptable as long as it does not violate the response time deadline. Thus, the edge system designers can choose the appropriate number of DUs by considering the performance requirements and resource constraints of the system under design.

D. SYSTEM-LEVEL ENERGY ESTIMATION
We compare the system-level energy when using our arithmetic coding-based compression for the 5-bit quantized and pruned weights with 4-DU decoder and baseline (i.e., without our compression and decoder while only 5-bit quantization is employed since the combined CNN accelerators supports 5bit precision arithmetic operations.). The system-level energy includes the CNN accelerator (with the 4-DU decoder in the case with our proposed technique) energy, DRAM-based main memory energy, and NVMe (Non-Volatile Memory Express) flash-based storage energy. Since the non-volatile flash storage will be accessed to load the weights into the main memory before CNN inferences, we have included the flash-based storage energy to our system-level energy estimation. Please note that we use the flash energy parameter reported in [34] (1J / 28MB = 4.26nJ per bit). Since CNN accelerator power is reported in [9] and [10], while it is not reported in [11], we only include the results with [9] and [10] for our system-level energy estimation. The reason why we choose the 4-DU decoder among the various configurations is that the 4-DU decoder can be accommodated in an edge/embedded platform (such as Ultra96) and shows a good tradeoff between the inference latency and resource usage. As shown in Table 6, for the combined accelerators with our proposed technique, power consumption and inference latency are increased, which results in an increased energy consumption in the FPGA by 40.2% and 5.7% with the CNN accelerators in [9] and [10], respectively. However, when considering the system-level energy consumption which includes the DRAM memory and flash-based storage energy consumption, our proposed technique with 4-DU configuration results in the system-level energy reduction by 9.3% and 1.1% with the CNN accelerators in [9] and [10], respectively. This is attributed to the reduced weight data size from our arithmetic coding-based weight compression.

VI. CONCLUSIONS
In resource-constrained edge devices, one of the most serious challenges for deploying on-device CNN inferences is huge weight data size which can hardly be fully stored in an edge device. In this paper, we have proposed an arithmetic coding-based 5-bit quantized weight compression technique with range scaling for lossless 5-bit weight compression. In addition, we have also proposed a decoding hardware for fast, yet efficient runtime weight decoding (decompression). Our evaluation results reveal that employing our weight compression technique to 5-bit quantized weights (not pruned) achieves 9.6× better compression ratio as compared to the uncompressed 32-bit FP weights. When employing our technique to the pruned 5-bit quantized weights, our technique results in 57.5×-112.2× better compression ratio as compared to the uncompressed 32-bit FP weights. Due to the reduced weight data size, our technique also leads to memory data transfer energy reduction by 89.2% (by up to 99.1% for pruned weights), on average, as compared to the uncompressed 32-bit FP weight data. When combining our decoding hardware with various state-of-the-art CNN accelerators, the latency overheads of our proposed technique with 16-DU decoder along with the latency hiding are only 0.16%-5.48% and 0.16%-0.91% for non-pruned and pruned weights, respectively. In addition, our proposed technique with 4-DU decoder hardware reduces system-level energy consumption by 1.1%-9.3% as compared to the case without our proposed technique. As our future work, we plan to generalize our arithmetic coding-based compression/decompression technique for weight elements with arbitrary bit-widths.