ExPAN(N)D: Exploring Posits for Efficient Artificial Neural Network Design in FPGA-based Systems

The recent advances in machine learning, in general, and Artificial Neural Networks (ANN), in particular, has made smart embedded systems an attractive option for a larger number of application areas. However, the high computational complexity, memory footprints, and energy requirements of machine learning models hinder their deployment on resource-constrained embedded systems. Most state-of-the-art works have considered this problem by proposing various low bit-width data representation schemes, optimized arithmetic operators' implementations, and different complexity reduction techniques such as network pruning. To further elevate the implementation gains offered by these individual techniques, there is a need to cross-examine and combine these techniques' unique features. This paper presents ExPAN(N)D, a framework to analyze and ingather the efficacy of the Posit number representation scheme and the efficiency of fixed-point arithmetic implementations for ANNs. The Posit scheme offers a better dynamic range and higher precision for various applications than IEEE $754$ single-precision floating-point format. However, due to the dynamic nature of the various fields of the Posit scheme, the corresponding arithmetic circuits have higher critical path delay and resource requirements than the single-precision-based arithmetic units. Towards this end, we propose a novel Posit to fixed-point converter for enabling high-performance and energy-efficient hardware implementations for ANNs with minimal drop in the output accuracy. We also propose a modified Posit-based representation to store the trained parameters of a network. Compared to an $8$-bit fixed-point-based inference accelerator, our proposed implementation offers $\approx46\%$ and $\approx18\%$ reductions in the storage requirements of the parameters and energy consumption of the MAC units, respectively.


INTRODUCTION
M ACHINE learning algorithms have become an essential factor in various modern applications, such as scene perception and image classification [1]- [3]. Over the past few years, these algorithms have mainly relied on the performance of modern computing systems to support the increasing complexity of the algorithms. For example, the massively parallel architectures, such as Graphics Processing Units (GPUs), and cloud-based computing have been traditionally used to train these algorithms. However, to utilize these trained machine learning models on resourceconstrained embedded systems, the computational complexity and storage requirements of these algorithms must be reduced.
Many recent works have considered this problem to define various optimization techniques to reduce the complexity of machine learning models, such as Artificial Neural Networks (ANN). For example, the techniques used in [5] and [6] have employed the sparsity of Deep Neural Networks (DNN) to reduce the total number of trained parameters. The works in [7], [8] and [9] have explored other number representation techniques, such as bfloat16, Posit and Fixed Point (FxP), to overcome the storage requirements of single-precision IEEE-754 Floating Point (FP32). Depending on the configuration used, each of these num-S. Nambi, S. Ullah  ber representation techniques provides different dynamic range to represent the parameters (weights and biases) of a network. For example, Fig. 1(a) shows the FP32-based distribution of the pre-trained weights of the Conv2_1 layer of VGG16 DNN [4]. The pre-trained weights have a dynamic range between −0.3 to +0. 3, with most of the weights clustered around '0'. To reduce the memory footprint of the weights and associated computational complexity, Fig. 1 [4] : (a) Average absolute relative error with respect to FP32-based parameters, (b) critical path delay, (c) normalized memory footprints.
represents the distribution using an 8-bit fixed point linear quantization scheme, referred to as FxP8. The FxP8 provides a set of 256 uniformly distributed discrete values, which generates an average relative error of 0.295 in the quantized weights. To reduce the quantization-induced errors, Fig. 1(c) shows the trained parameters using an 8-bit Posit scheme. The Posit technique covers more values around 0, resulting in an average relative error of 0.052 in the quantized weights. Therefore, it is imperative to define number representation schemes (or quantization methods), which can significantly maintain FP32-based machine learning models' accuracy and reduce their corresponding computational complexity and storage requirements. The various number representation schemes (quantization methods) result in varying performance overheads of their associated arithmetic hardware. For example, Fig. 2 shows the comparison of the effect of using different quantization methods across multiple performance aspectsbehavioral (error in the quantization of weights), computational (critical path delay of a Multiply and ACcumulate (MAC) unit), and memory requirements (weights' storage) in the Conv2_1 layer of pre-trained VGG16. The hardware implementation results have been obtained by implementing each technique on the Xilinx UltraScale Field Programmable Gate Array (FPGA) using Vivado HLS 2018.2. For a fair comparison, the critical path delay (CPD) is obtained from MAC units implemented using 6-input lookup tables (LUTs) and with a latency of a single cycle. As shown by the results, higher bit-widths for the quantization schemes significantly reduce quantization-induced errors. The FP32 implementation has the highest memory footprint with the worst CPD of 42ns. The Posit schemes provide better coverage of the FP32-based pre-trained parameters than the corresponding FxP-based schemes. However, the FxP-based arithmetics' simplicity results in significantly reducing the CPD of the MAC units when compared with the corresponding Posit schemes.
Most state-of-the-art works do not consider applicationspecific optimizations to the quantization methods. For instance the Posit related works focus on representing the whole range of real numbers, (−∞, ∞), rather than the actual range of the parameters in the application. Similarly, many related works consider each quantization method in isolation and do not attempt to leverage the best features of multiple methods. To this end, we propose ExPAN(N)D framework for Exploring the joint use of Posit and FxP representations for Designing efficient ANNs. The major contributions in this paper are as follows.

Contributions:
1) We propose a reduced bit-length Posit-based representation that improves the encoding efficiency to reduce the communication and storage costs in ANNs. Using our proposed representation for each N -bit Posit number, we only store N − 1 bits.
2) We propose a novel arithmetic hardware design, referred to as Posit to Fixed Point (PoFx), that aims to combine the best of both Posit and FxP number representations. The proposed hardware unit offers resource-efficient and lowlatency conversion of Posit-based numbers to FxP-based numbers to leverage the lower computation overheads of fixed-point arithmetic.For example, an 8-bit PoFx-based MAC provides up to 15% resource overhead with a corresponding 46% reduction in the storage requirement of a network's parameters.
3) Framework for Behavioral Analysis: We provide a high-level framework for the efficient and thorough exploration of various quantization schemes to satisfy the accuracy constraints of a DNN. The proposed framework explores the limitations and the interplay of various quantization schemes, such as FxP to Posit to FxP, to minimize the quantization-induced errors. The framework prunes the non-optimal quantization configurations by analyzing the quantization induced-errors in (a) parameters of individual layers, (b) output activations of each layer using quantized weights, and (c) final output of the network. For example, our framework explores various N -bit Posit configurations to achieve output accuracy comparable to an M -bit FxPbased quantization, where N < M . 4) We explore the impact of using the proposed hardware designs in a fully-connected layer. Specifically we use an automated design flow, using state-of-the-art High Level Synthesis (HLS) tools, to explore storage-computation tradeoffs in the design of FPGA-based accelerators for ANNs. For example, compared to an FxP-based accelerator, the PoFxbased accelerator provides up to 46% and 18% reductions in the storage and energy requirements of an accelerator.
The rest of the paper is organized as follows. In Section 2, we provide the relevant background and brief overview of related work. The system model used for the evaluation of the proposed methods is presented in Section 3. In Section 4, we explain the proposed methodology for exploring the use of Posit representation for ANNs, along with the proposed hardware designs. In Section 5, we discuss the results from the experimental evaluation of the different components of the proposed methodology. Finally, we conclude the article in Section 6 with a summary and a discussion on the scope for related future research.

Posit Number System
The IEEE 754-2008 compliant floating-point (floats)-based arithmetic has become ubiquitous in modern-day computing and is deeply embedded in compilers and low-level software routines. However, the floats have several limitations, such as non-identical results across systems, redundant/wasted bit patterns, and a limited dynamic range. The Posit number scheme overcomes these limitations by offering a better dynamic range and portability across various computing platforms. Fig. 3 shows the various fields (sign, regime, exponent and fraction) of the Posit number scheme. A Posit configuration is characterized by its total bit length (N ) and the number of bits reserved for exponent (ES). Utilizing the four fields of the Posit scheme, Eq. 1 defines the computation of a Posit value. The regime field, in Fig. 3, is utilized to compute the value of k in Eq. 1. The regime field is terminated when an inverted bit (r) is encountered, and the associated value of k is determined by the number of identical bits (m); if the identical bits are a string of 0s, then k = −m; if they are a string of 1s, then k = m − 1. Next, the exponent (e) and fraction values (f ) are determined using the remaining bits. The utilization of regime field provides a better dynamic range to Posit number scheme. For example, the authors in [10] have reported that for some applications, the n-bit floats can be replaced by m-bit Positbased numbers (where m < n) to achieve comparable output accuracy.
Compared to the floats and fixed-point number representation schemes, Posit requires more computational resources. In the following section, we summarize the state-ofthe-art works related to hardware implementation of Positbased arithmetic circuits.

Posit Arithmetic Hardware
The major challenges faced while developing an efficient hardware implementation for Posit arithmetic involve-(1) handling run-time length variation in individual Posit fields, (2) extraction of Posit components to facilitate further manipulation and, (3) implementation of rounding algorithms as proposed in the Posit standard. TABLE 1 presents an overview of the state-of-the-art work related to Positbased arithmetic and highlights our proposed framework's key focus. These works are summarized below.
The authors in [10] tackle run-time varying field length by developing hardware arithmetic architectures for conversion from Posit to floating point and vice-versa. The work in [13] proposes a tool to generate pipelined Posit operators to be used as a drop-in replacement in processing units. In [11], authors present the architecture of a parameterized Posit arithmetic unit to generate posit adders and multipliers of any bit-width. Similarly, PACoGen [12] employs a three-stage process which involves Posit data extraction, core arithmetic processing and Posit construction to perform parameterized Posit arithmetic including multiplication and division. It proposes improvements in Posit data extraction methodology and a pipelined architecture for Posit (N=32, ES=6). Posit arithmetic has also been integrated into Clarinet [21] which is a RISC-V ISA based processor that supports the use of a Posit arithmetic core. However, the RISC-V implementations are not capable of handling largescale applications.

Arithmetic Hardware for ANN Inference
A plethora of recent works have considered different quantization schemes to reduce the memory footprints and computational complexity of DNNs for resource-constrained embedded systems and edge devices for IoT. These techniques can be categorized into (a) in-training quantization, and (b) post-training quantization schemes. For example, the techniques proposed in [22]- [25] have considered various fixed-point schemes for in-training quantization. The in-training quantization schemes can overcome most of the quantization-induced errors. However, these techniques cannot be utilized for the quantization of the parameters of pre-trained DNNs. For example, for the quantization of pre-trained DNNs, [9], [26]- [28] have proposed different schemes. The techniques presented in [27], [28] have focused on the utilization of logarithmic data representations to avoid the computationally expensive multiplication operations. However, some recent works, such as [29]- [31] have utilized fixed-point quantization schemes to employ the well-explored high-performance and energy-efficient approximate adders and multipliers. The utilization of approximate arithmetic units [32]- [35] provides another degree of freedom for achieving the accuracy, performance, and energy constraints of DNNs for IoT. For example, the authors of [29] have utilized the library of approximate multipliers [33] to provide approximate accelerators for reduced-precision DNNs.
Some recent works have also explored the utilization of Posit numbers for training and inference phases of ANN. For example, the work in [16] has used ARM scalable vector extension SIMD engine to present vectorized extensions for the cppPosit C++ posit arithmetic library. The authors of [14] have proposed an exact multiply and accumulate (EMAC) for implementing the MAC operations in ANN. Their results show that the Posit-based representation of networks' parameters performs better than fixed-point-based representation in retaining the output accuracy of ANN. However, the Posit-based EMACs have significantly higher resource utilization and energy-delay product (EDP) than the fixed-point-based MAC operations. In [20], the authors have also proposed a parametrized Posit MAC generator to produce the HDL code of a Posit MAC unit. However, they do not present the efficacy of their proposed design in any real-world application. In [18], the authors have also used the EDP metric to compare their proposed Posit-based framework with the FP32-and FxP-based implementations; the FxP-based implementations always produce lower EDP Currently, the Posit numerical scheme's utilization in implementing accelerators for various applications is hampered by the unavailability of resource-optimized and energy-efficient Posit arithmetic units. In our proposed work, we aim to leverage the useful storage capability of Posit by modifying the Posit number representation to store numbers within the sub-normal region and the compute efficiency of FxP-based arithmetic by implementing a PoFx converter.

Application Model
The hardware designs proposed in our current work can be used for any arbitrary application that needs to communicate and/or store a large number of parameters. However, in this article, we limit our exploration to ANNs. Fig. 4(a) shows one of the more widely used ANN-the VGG16 [4]in research. As shown in the figure, VGG16 is composed of 16 layers of 4 different types-convolutional, max pooling, fully connected and softmax. Although we use the VGG16 as the application for evaluating our proposed methodology, the methods are applicable to any arbitrary ANN as most networks are composed of a subset of these types of layers. Fig. 4(a) also shows the dimension of the parameters that are used in each of the layers. Using accelerators for inference usually involves communicating and storing these large number of trained parameters-138 million for VGG16. Consequently, the quantization methods used for the parameters can influence the corresponding storage and communication overheads. Similarly, given the large number of MAC operations involved in the inference of a single input-15.5 billion for VGG16-the speed and power dissipation of the MAC unit determines the throughput and energy consumption of ANN inference. Fig. 4(b) shows the architecture model used in this article. As shown in the figure, we assume an FPGA-based Systemon-Chip (SoC) as the hardware platform. It contains an embedded processor along with reconfigurable logic similar to the Zynq EPP [36]. We assume that the accelerators for different types of layers of an ANN are executed on the reconfigurable logic and can implement the proposed hardware designs. For any accelerator, we assume that the parameters of the corresponding layer are fetched from the main memory through streaming interfaces with the onchip AXI interconnect [37]. Similarly the input and output activations are transferred from and to the main memory using AXI streaming interfaces as well. Hardware platforms based on the Zynq EPP, such as the Ultra96-V2 [38], are being widely marketed as edge processing devices for Internet of Things (IoT).

DESIGN METHODOLOGY
The top-level view of ExPAN(N)D is shown in Fig. 5. The Hardware design and characterization of the MAC units for various quantization schemes forms the central theme around which the other two methods-Behavioral analysis and Accelerator design-are implemented. Behavioral analysis enables the estimation of quantization-induced errors in a given ANN using the proposed hardware designs. Similarly, Accelerator design allows the designer to estimate the performance-resource trade-offs resulting from implementing various quantization schemes in an accelerator for a given layer of the ANN. The results from each of the three methods can be used to constraint the search space in the design of an efficient ANN using successive design

Normalized Posit Representation
The Posit representation is inherently designed to encode numbers in the range (−∞, ∞). However, due to their tapered accuracy, numbers near ±1 have better accuracy in comparison to extremely small or large numbers [39]. Thus, low-precision Posit numbers perform better than an equivalent linear fixed-point representation during the quantization of normalized ANN weights. While processing normalized numbers, sub-optimal utilization of all possible Posit bit-patterns leads to half of them being unused. This can translate to communication and storage overheads, as more than required bits are being transferred around. Similarly, a higher number of bits, than that required for storing the information, are processed during each computation. Hence, we propose normalized Posit-an alternative representation based on Posits which preserves its encoding efficiency, hardware realization and tapered accuracy while doubling the usable bit patterns within the normalized range. This normalized Posit representation is a logical subset of Posits that is customized for the representation of normalized numbers. For example, rows in the table show the bit-patterns which represent normalized numbers. It is evident that the two leading bits of the Posit representation are identical when the bit pattern denotes a normalized number; we leverage this finding to drop the leading Posit bit in our proposed normalized Posit representation. This Posit representation helps us encode N -bit Posit functionality within the normalized range with N − 1 bits. This leads to a reduction in storage requirement while still being able to reuse existing Posit arithmetic hardware by replicating the leading bit near the processing unit. However, existing hardware implementations are not optimized to perform normalized Posit-only arithmetic. Existing implementations do not take complete advantage of the benefits arising as a consequence of the potentially unidirectional nature of bit shifts required to extract normalized Posits. To this end, we propose a novel parameterized Posit-to-FxP converter, PoFx, that implements an optimized extraction for normalized Posit numbers.

PoFx: Normalized Posit to Fixed-Point Converter
Most Posit-based computations require a decode stage to extract the value before arithmetic operations. Currently, Posit-based arithmetic relies heavily on extraction of Posit numbers to a floating point like representation before operating on them, which leads to increased resource utilization.  20: for (j = 0; j <= i; j + +) do 25:

26:
if i <= switch then 27:  TABLE 2, we illustrate how we use minimal resources during conversion to a FxP format after Posit field-extraction by working at the bit level. The key to developing this algorithm rests on recognizing that the fraction field extracted from the Posit representation is identical to that required in the FxP output. Thus, once the data in the Posit bit pattern is extracted into its components s, k, e and f ; the posit value only requires us to set a bit and store the extracted fraction bits to its right followed by a final bit shift determined by the equation 2 ES * k + e. This equation can be implemented by adding the e value to the bit-sequence obtained by appending k to ES number of zero bits as illustrated in Fig. 6. The signbit along with the shifted bit sequence gives us the Posit representation in sign-magnitude FxP format. PoFx conversion algorithm for manipulation at the bitlevel which converts Posit representation, Posit(N, ES) to fixed-point representation FxP(M, F ) is summarized in Algorithm 1. Stage A (comprising of Stages A1, A2 and A3) stores the sign bit and prepares the Posit bits for subsequent extraction. Stage B1 implements an optimized algorithm to evaluate the number of contiguous 1's. Stage B2 performs bit manipulations to ascertain location of exponent and fraction bits and subsequently extracts them. All the loop indices are carefully evaluated based on the constraints arising from the Posit representation. Stage C performs the bit shift calculation and Stage D implements the bit shifts. The final Stage E is optional depending on the application and involves the conversion of sign-magnitude format to two's complement.
The proposed PoFx can be adapted to perform normalized PoFx conversion which leads to lower resource utilization and improved performance in ANNs. This is primarily due to the drastic simplification of Stage C and Stage D as in this case the shifts are unidirectional, that is towards the right, making the value smaller. For normalized Posits we set F = M −1 as all but one bit would be used for the sign. The first bit is replicated within Stage A followed by simplified extraction in Stage B1 as the regime bit would always begin with zero thus K would store only magnitude. We use an optimized algorithm to evaluate the modified shift equation 2 ES * K − E in Stage C which is illustrated in Fig. 6. We store 1 after the assumed decimal point in normalized PoFx extraction and thus always need to right shift one time less. This is achieved implicitly by adding the one's complement of E to 2 ES * K; further we will The five stages in our proposed design can be pipelined to further improve the throughput of the PoFx converter as there are no feedback paths between the stages, thus eliminating data hazards. We note that though normalized Posit representation can represent the value −1, the normalized PoFx cannot extract the same due to its implicit storage in sign-magnitude format. For the rest of the article, the term PoFx will be used to denote the normalized PoFx. Similarly, Posit(N, ES) and Posit(N − 1, ES) will be used to denote Posit and normalized Posit respectively.

MAC Unit with PoFx Converter
The PoFx converter can be used for any application that can benefit from storing a large number of parameters efficiently. As a special case for ANNs, we integrate the normalized PoFx into MAC units to facilitate the use of our proposed optimizations for improving low-precision ANN inference. Fig. 7 shows the schematic of a parameterized PoFx converter based MAC along with ReLU activation function. As shown in the figure, the weights/biases are assumed to be stored/communicated as Posit(N − 1, ES) numbers. These values are then converted to their corresponding M -bit FxP representations and multiplied with the M -bit input activation values. To accommodate the overflows resulting from the accumulation of a large number of 2M -bit values, we propose to use a 3M -bit adder. After accumulating all the values, for a single node in a layer of an ANN, we pass the 3M -bit result to the activation function.
It can be noted that the PoFx-based MAC unit allows the designer to represent the weights/biases with a fewer number of bits while still being able to implement different kinds of FxP-based arithmetic optimizations, such as precisionscaling, approximations, etc. However, the effect of such a reduced bit representation on the ANN's behavior, and the corresponding reduction in the compute and communication/storage overheads of the associated accelerators for each layer needs to be estimated. The next two subsections provide the details of our contributions regarding these aspects of designing a PoFx-based ANN.

Behavioral Analysis
To evaluate the impact of various quantization schemes on the output accuracy of a DNN, we have utilized TensorFlow for implementing a high-level behavioral framework, as shown in Fig. 8 Fig. 8. However, our proposed framework is generic and allows the integration of other types of quantization schemes. Our proposed workflow performs a thorough analysis of the inter-conversions of these schemes to evaluate the impact of the available quantization step sizes and the dynamic ranges offered by each scheme. For example, the FxP-based representation of an FP32-based parameter can be achieved as shown by 1 , 3 and 5 paths in the figure. As shown by the classification accuracy results in Section 5, the utilization of each of these schemes has a distinct impact on the final output accuracy. After providing the description of an ANN and the various quantization schemes, the proposed framework provides quantization configurations fulfilling the desired accuracy constraints. These selected configurations are then used by our proposed Accelerator Design tool flow to compute their respective performance metrics.

Accelerator Design
The HLS-based design flow, shown in Fig. 9, is used for evaluating the associated trade-offs between computation overhead and communication/storage gains offered by the PoFx-based MAC units. The design choices tree originating from HLS directives shows the various degrees of freedom (not exhaustive) associated with the design of an accelerator for a fully-connected layer. We assume a weightstationary [40] design, where a set of weights for a subset of the artificial neurons in the layer are transferred once to the hardware accelerator. Subsequently, each input activation vector is transferred and the corresponding output activation of each neuron is computed. Therefore, the computation of each output activation vector can be seen as the multiplication of a matrix (weights) by a vector (input activations).

Move Posit
Layer Specification  Fig. 7). However, if the weights are moved as Posit(N − 1, ES) and stored as FxP (using PoFx), the MAC units do not require the run-time conversion during each computation. However, this approach increases the storage requirements compared to storing as Posit(N − 1, ES). It must be noted that the joint exploration across HLS directives and quantization schemes is necessary for a good estimation of accelerator characteristics. Performance improvement using HLS directives usually involves replicating compute and memory resources which are in turn dependent upon the choices related to the quantization schemes.

Experiment Setup
The proposed PoFx converter and the associated computer arithmetic blocks were implemented using Verilog HDL. Python-based scripts were used for automating the generation of the parameterized designs. SmallPosit HDL [41] was used for generating the Posit-based arithmetic designs. The hardware designs were characterized using Xilinx Vivado Design Suite. For the calculation of the dynamic power of all implementations, Vivado Simulator and Power Analyzer tools have been utilized. All designs have been implemented on Xilinx Zynq UltraScale+ MPSoC (xczu3eg-sbva484-1e device). The behavioral analysis was achieved using Python-based implementations and used TensorFlow [42] for estimation of various quantization induced error metrics. Xilinx Vivado HLS 18.3 was used as the High-level Synthesis tool for accelerator design. While the results for the behavioral analysis correspond to the experiments using VGG16 as the test application, all the proposed methods can be used for any arbitrary application.

Normalized PoFx
We analyze the impact of varying output bit-width (M ) of PoFx converter on the overall performance of PoFx for a given configuration of Posit.  dynamic power consumption of the PoFx for various Posit configurations.

MAC Design Analysis
The proposed PoFx allows the utilization of resourceefficient and high-performance FxP-based arithmetic units for Posit number systems. To evaluate the efficacy of the proposed approach and estimate the associated overheads of the PoFx, we compare PoFx-based 8-bit MAC units 3 with a traditional FxP-based MAC unit. The results of these comparisons for various configurations of Posit are presented in Fig. 12. The critical path delay and resource utilization of the MAC follow a gradually rising trend with both N and ES values. It can be noted that in a few cases, especially for ES = 0, the PoFx-based MAC provides better performance across critical path delay, power dissipation, and LUT utilization than the FxP-only MAC. For ES = 0, the Posit scheme's dynamic range is limited, and the PoFx does not utilize the complete dynamic range of the FxP. The limited number of unique FxP values, after conversion, allows the synthesis tool to optimize the overall design of PoFx-based MAC to improve the associated performance metrics. The power metrics do not follow a well defined trend as they are generated based on the bit switches required to obtain the correct bit-sequence as the output. Compared to the FxPonly MAC, we report worst-case overheads of 22.8%, 5.0% and 15.5% for critical path delay, power dissipation, and LUT Utilization, respectively. Similar trends are observed in Fig. 13, which compares the same performance metrics for a 16-bit FxP MAC. To further evaluate the efficacy of PoFx-based MAC design, we compare it with FxP-only MAC, Posit-only MAC, and Posit-based 3-input Fused Multiply Add (FMA) [41]. Moreover, for a thorough exploration of the PoFx-, FxP-, and Posit-based designs, we have synthesized two types of designs-one that allows the synthesis tool to optimize across the constituent blocks (converters, multipliers, and adders) and the other that performs optimization for the constituent blocks separately. Fig. 14 and Fig. 15 show the comparison of the power-delay-product (PDP) and the 3. As shown in Fig. 7, an M-bit FxP-based MAC includes a M × M multiplier and a 3M -bit adder.   LUT utilization of these designs for 8and 16-bit designs, respectively. Posit-only MAC, which has been implemented by using a standalone N -bit Posit adder and N -bit Posit Multiplier, has significantly higher PDP and LUT utilization as a result of the extraction and packaging of Posits between stages. The Posit-based FMA, though optimized, requires more hardware resources for implementation. It can be observed that the PoFx-based MAC designs fall closely within the range of FxP-only MAC. Further, the Posit-only MAC and Posit-based FMA designs generate an N -bit output whereas, the proposed design generates a more precise 3Nbit output once extracted. This can lead to lower inter-layer losses in ANNs as we can ascertain the type of rounding mechanism at the output based on the network to retain as much precision as possible before transferring the value to the next stage.

Behavioral Analysis
We have considered DNNs as a test case to show the impact of various number representation schemes on the output accuracy of high-level applications. For this work, we have used a pre-trained VGG16 [4] network for the classification of the ImageNet dataset [43]. The VGG16 network mainly consists of 13 convolution layers and 3 fully connected layers. The very large number of the network's trained parameters, 138 million, makes it a sound candidate for evaluating efficiency of various quantization schemes. The single-precision FP32-based Top-1 and Top-5 percentage output classification accuracy of the 50000 validation images in the ImageNet dataset is 69.72% and 89.09%, respectively. Our proposed TensorFlow-based framework performs a multi-level analysis to identify possible quantization configurations fulfilling the output accuracy requirements of the network.

Weights Quantization Error Analysis
In the first step, our framework quantizes the parameters (weights and biases) of all layers and filters out the configurations having large quantization-induced errors. For example, Fig. 16 8  16  8  16  8  16  8  16  conv1_1  9  7  1  5  1  0  173  74  conv1_2  8  4  3  11  1  0  125  2  conv2_1  8  4  3  9  1  0  121  2  conv2_2  8  3  2  8  1  0  119  1  conv3_1  8  3  2  8  1  0  115  1  conv3_2  8  1  4  10  1  0  109  0  conv3_3  8  1  4  10  1  0  109  0  conv4_1  8  1  6  8  1  0  104  0  conv4_2  7  1  6  8  1  0  97  0  conv4_3  7  1  6  8  1  0  98  0  conv5_1  7  1  6  8  1  0  102  0  conv5_2  8  1  6  8  1  0  102  0  conv5_3  7  1  6  8  1  0  102  0  fc6  7  0  5  8  1  0  67  0  fc7  7  0  5  8  1  0  86  0  fc8  8  1  6  8  1  0 104 0 In our current work we focus only on the quantization of weights and biases. The use of a specific quantized representation of the weights and biases will require the use of a compatible MAC design for inference. Hence, we performed a joint analysis of the performance of the various MAC designs and the errors induced in the parameters by the corresponding quantization scheme. The various MAC designs are grouped under three categories -PoFxbased, Posit-based (that includes both multiply and adder   conv1_1  21  31  10  41  0  0  40  74  conv1_2  20  32  11  23  0  0  27  50  conv2_1  20  31  10  23  0  0  27  48  conv2_2  17  30  11  18  0  0  27  48  conv3_1  17  29  10  17  0  0  27  47  conv3_2  17  26  12  18  0  0  27  46  conv3_3  17  26  12  18  0  0  26  46  conv4_1  17  26  10  16  0  0  27  46  conv4_2  17  26  10  16  0  0  27  46  conv4_3  17  26  10  16  0  0  27  46  conv5_1  17  26  10  16  0  0  27  46  conv5_2  17  26  10  16  0  0  27  46  conv5_3  17  26  10  16  0  0  27  46  fc6  17  25  11  15  0  0  28  46  fc7  17  26  11  15  0  0  27  46  fc8  17  26  10  16  0  0  27  46 combination and FMA-based designs) and FxP-based. For the PoFx-based and Posit-based designs, lower bit-width input designs were also considered. For example, for 8-bit quantization, N was varied from 5 to 8. Similarly, for 16-bit quantization, N was varied from 5 to 16 . TABLE 3 shows the Pareto analysis results for 8and 16-bit MACs with the three objectives -PDP, average quantization-induced error and the LUT utilization. We report the number of dominating points for each of the three types of quantization schemes used for the parameters of each layer of VGG16. As shown in the table, using PoFx-based designs contribute significantly to the number of points on the Pareto-front for 8-bit precision. We also report the percentage increase in the Pareto-front hypervolume due to the usage of PoFxbased designs over the collection of Posit and FxP-based designs only. As seen in the table, using PoFx-based designs we report up to 173% increase in the hypervolume for 8bits precision. Fig. 17 shows the dominating and dominated points for each of the three categories in the corresponding design space for 8-bit precision MACs for the first layer (Conv1_1) of VGG16. It can be observed that the Posit-and FxP-based designs contribute one point each to the resulting Pareto-front, compared to 9 PoFx-based points. 5 The improvements for 16-bit precision are lower compared to 8-bits. However, as shown in TABLE 4, if we also consider the bits-width of the parameters as a design objective in the analysis, we report consistent improvements using PoFx-based designs for both 8and 16-bits precision. Since the number of input bits is an indicator of the communication power dissipation (and energy consumption) for moving weights, using PoFx-based quantization can result in reducing the overall power dissipation during DNN inference. In the second step of behavioral analysis, our framework utilizes the quantized parameters to evaluate each configuration's impact on the output activations of each layer. The computation of the output activation involves using a MAC design that is compatible with the chosen quantization scheme. Similar to the analysis presented in Fig. 17 for the errors induced in the parameters, Fig. 18 shows the design space while considering the errors in the output activations for the first layer-Conv1_1-of VGG16. The 3D scatter plot shows the various design points corresponding to the three categories of MAC designs, PoFx-, Posit-and FxP-based. It can be observed from Fig. 18 that the PoFx and FxP-based designs' contribution to the Pareto-front is mainly due to better hardware performance-lower PDP and reduced number of utilized LUTs. Similarly, Posit-based designs' contribution is mainly due to lower average error, albeit at high hardware costs. The resulting Pareto-front in Fig. 18 has 7, 13 and 1 points from PoFx, Posit and FxP-based designs respectively, with 12.4% improvement in the hypervolume over the collection of only Posit-and FxP-based designs. It must be noted that since we focus on the quantization schemes for only the parameters, during the behavioral analysis, the input activations for each of the layers are kept at FP32 precision. After computing the output activations, they are quantized using the configuration employed to quantize the respective parameters. This lets us evaluate the impact of the proposed methods and designs while other aspects are kept unchanged.

Classification Error Analysis
Finally, the behavioral analysis involves estimating the impact of the proposed methods on the classification accuracy.   [41]. Similarly, as shown in TABLE 5  Similar correlations were also observed in the case of PoFx-based designs. Designs with higher PDP usually result in better accuracy. Compared to FxP-8 based designs the PoFx(N − 1 = 7, ES = 1) achieves similar accuracy with lower PDP (≈ 5%) and slightly higher LUT overhead (≈ 15%). Similarly, PoFx(N − 1 = 6, ES = 2) achieves comparable accuracy with even lower PDP (≈ 18%) and less LUT overheads (≈ 8%). Additionally, these PoFx-based designs requires less bits for representing the parameters

Accelerator Design Analysis
In order to estimate the system-level impact of using the proposed PoFx methodology, we integrated the candidate solutions in the design of an accelerator for a fullyconnected layer of a DNN. The accelerator was designed using C++ and synthesized using Xilinx's Vivado HLS. To keep the design generic, we implemented a matrix-vector multiplication. The matrix, representing the weights of the fully-connected layer, was of size 64 × 10. Each vector, representing an input activation, is of size 1 × 64. One thousand input activations were used to estimate the switching activity in order to compute the power dissipation. The implemented accelerator uses ReLU activation function.

Accelerator DSE
In Section 4.3, we presented some of the degrees of freedom in the design of an accelerator that can result in varying performance metrics and resource requirements. Fig. 19 shows the resulting metrics from seven different implementations of the accelerator under test using 8-bit FxP parameters and activations. The base implementation refers to the basic design without using any HLS directives. As shown in TABLE 7, the other implementations vary in terms of the optimization goals, and the type and partitioning of the memory used for storing weights and activations. As seen in Fig. 19, the optimizations of fullOpt result in the lowest latency implementations. However, it also results in the highest power dissipation and maximum resource utilization. Also it can be noted from the figure that using BRAMs results in lower total power dissipation than using LUTRAMs for both fullOpt and dotOpt implementations.
Further, it can be observed that the CPD varies with the optimization modes, even with the usage of the same arithmetic hardware, as the implementation of multiple parallel operators spreads the designs spatially and the routing delays tend to increase accordingly. It must be noted that Fig. 19 shows only a subset of the possible implementations using the various degrees of freedom. While it is possible to generate many more design points using various types of array partitioning and loop-related HLS directives, we limit our evaluation to these seven types of implementations for showing the impact of our proposed designs. It can be observed that the Posit-based design has higher CPD, power dissipation and LUT utilization for almost every implementation. For instance, we report ≈ 80% reduction in the CPD with the PoFx(Move & Store)-based design for the fullOpt implementation. The latency metric is similar in case of each implementation type across the four design types. It can be observed that the PoFx(Move & Store) has higher CPD than PoFx(Move) for base implementation, but lower CPD for dotOpt and fullOpt implementations.

Accelerator Resource Requirements
In the base implementation of PoFx(Move & Store), the additional over head of the PoFx conversion during computation increases the CPD. However, in the dotOpt and fullOpt implementations, the additional interfaces of the   partitioned, higher bit-width FxP weights-array dominates the low-cost PoFx converter delay. Since in the absence of any memory-specific HLS directives, the LUTs are used to store the weights, the LUT utilization of PoFx(Move & Store) design is the lowest for each implementation type. We report ≈ 60% reduction in the LUTs utilization with the PoFx(Move & Store)-based design over the Posit-based design for the fullOpt implementation. Further, BRAMs are not instantiated in the dotOpt and fullOpt implementations in any of the designs. However, as shown in Fig. 19, BRAMs present a more power-efficient alternative.
We explored the impact of using memory-related HLS directives in the four designs for the fullOpt and dotOpt implementations. Fig. 21 shows the accelerators' relative resource requirements for the fullOpt_BRAM implementation of the four designs with varying configurations of Posit(N, ES) and PoFx(N − 1, ES). It can be observed that the LUT utilization of Posit is much higher in all cases. This can be attributed to the high hardware costs of the Posit arithmetic blocks. Similarly the RegFF utilization of PoFx(Move & Store) is lower than that of PoFx(Move) designs for most cases. However, the BRAM utilization remains constant for all the designs across all configurations. This is due to the granularity of the BRAM memory. Even if weights are stored as values lower than 8-bits, equal number of instances of BRAMs are used. However, as shown in Fig. 22, if LUTRAMs are used for storing the weights and activations (fullOpt_LRAM implementation), lower LU-TRAM utilization is observed in PoFx(Move & Store) than PoFx(Move). For instance, compared to the Posit(N = 7, ES = 0) we report ≈ 46% reduction in LUTRAMs utilization with the PoFx(N − 1 = 6, ES = 0) design. This difference is reduced to zero in PoFx(N − 1 = 7, ES = 0). Similarly, the difference in RegFF utilization of PoFx(Move) and PoFx(Move & Store) designs reduces with increasing value of N − 1. Therefore, the proposed PoFx representation results in reduction in the accelerator's overall resource consumption.

CONCLUSION
To implement machine learning applications on resourceand energy-constrained embedded systems with limited computational power, it is imperative to consider the unique features of various optimization techniques together. This paper proposes the ExPAN(N)D framework for analyzing and combining the number representation efficacy of the Posit scheme and the resource-and compute-efficiency of FxP-based schemes. ExPAN(N)D utilizes a modified and novel representation of Posit numbers systems to represent the trained parameters of DNNs. Using the proposed scheme, we use N −1 bits for an N -bit Posit configuration to reduce the storage requirements. For performing arithmetic operations on trained parameters, stored in Posit format, ExPAN(N)D proposes and utilizes a resource-efficient Posit to FxP converter PoFx. Using PoFx, all arithmetic operations are performed using FxP-based arithmetic operators.
Compared to 6-bit Posit-based implementation, our proposed 8-bit PoFx-based ANN accelerator provides up to 80% and 60% reduction in overall resource utilization and critical path delay, respectively for the highest throughput design. Further, compared to an 8-bit FxP-based implementation, the 8-bit PoFx-based accelerator provides up to 46% reduction in storing the trained parameters. ExPAN(N)D utilizes a TensorFlow-based behavioral framework to evaluate the impact of different quantization configurations on the final output accuracy of ANNs. We intend to extend the proposed framework by incorporating other networks' optimization techniques such as approximate arithmetic operators and various other quantization schemes.