High-Level Design of Precision-Scalable DNN Accelerators Based on Sum-Together Multipliers

Precison-scalable (PS) multipliers are gaining traction in Deep Neural Network accelerators, particularly for enabling mixed-precision (MP) quantization in Deep Learning at the edge. This paper focuses on the Sum-Together (ST) class of PS multipliers, which are subword-parallel multipliers that can execute a standard multiplication at full precision or a dot-product with parallel low-precision operands. Our contributions in this area encompass multiple aspects: we enrich our previous comparison of SoA ST multipliers by including our recent radix-4 Booth ST multiplier and two novel designs; we extend the explanation of the architecture and the design flow of our previously proposed ST-based PS hardware accelerators designed for 2D-Convolution, Depth-wise Convolution, and Fully-Connected layers that we developed using High-Level Synthesis (HLS); we implement the uniform integer quantization equations in hardware; we conduct a broad HLS-driven design space exploration of our ST-based accelerators, varying numerous hardware parameters; finally, we showcase the advantages of ST-based accelerators when integrated into System-on-Chips (SoCs) in three different scenarios (low-area, low-power, and low-latency), running inference on MP-quantized MLPerf Tiny models as case study. Across the three scenarios, the results show an average latency speedup of 1.46x, 1.33x, and 1.29x, a reduced energy consumption in most of the cases, and a marginal area overhead of 0.9%, 2.5% and 8.0%, compared to SoCs with accelerators based on fixed-precision 16-bit multipliers. To sum up, our work provides a comprehensive understanding of ST-based accelerators’ performance in an SoC context, paving the way for future enhancements and the solution of identified inefficiencies.


I. INTRODUCTION
In the context of Deep Learning (DL) at the edge, quantization is an established method for reducing memory footprint and bandwidth, saving energy, and performing faster inference when dealing with Deep Neural Networks (DNNs) on resource limited devices [1].Recently, there has been a growing interest in academia and industry towards Mixed-Precision Quantization (MPQ) [2].This technique The associate editor coordinating the review of this manuscript and approving it for publication was Mario Donato Marino .leverages the different sensitivity to quantization of each DNN layer [3], [4] to search for the optimal number of activation and weight bits for each individual layer, enabling accuracy vs latency and accuracy vs energy trade-offs [5].
To take advantage of MPQ, several Precision-Scalable (PS) multipliers, Multiply-and-Accumulate (MAC) units, and DNN accelerators have been recently proposed targeting Machine-Learning (ML) workloads [6], [7], [8], [9], [10], [11], [12].In this paper, we focus on a particular family of PS multipliers called Sum-Together (ST).These are special reconfigurable subword-parallel multipliers that, depending on the selected configuration, not only perform N multiplications in parallel among multiple low-precision operands, but also sum together the results of these N multiplications within the multiplier itself, i.e., without requiring external addition.At full precision, they perform N = 1 standard multiplications (e.g., on 16 bits), whereas at reduced precision they compute N = 2 or 4 parallel dot-products using the low-precision operands packed in the multiplier's inputs (e.g., operands on 8 or 4 bits).In other words, the bitwidth of the operands is inversely proportional to N (e.g., 16/N bits).ST multipliers are well-suited for integration within MAC units [6], [9].They enable parallel multiplications of low-precision quantized data in a Single Instruction Multiple Data (SIMD) fashion and, at the same time, their sum together feature contributes to further speeding up MAC operations, by saving N-−1 MAC additions compared to conventional MAC units.For this reason, ST-based PS MAC units have recently found application in layer-specific DNN hardware accelerators, providing MPQ support and speeding up the overall layer computation by a factor up to N [8], [10], [11].
The contributions of this paper in this field cover multiple aspects: 1) We enrich our previous performance, power, and area (PPA) comparison of SoA ST multipliers [8] by introducing two novel designs.The first, which we name BW-ADD, is an improved Baugh-Wooley (BW) multiplier with a modified final adder that provides a shorter critical path than the original Ripple Carry Adder (RCA) used in [6].The second, which we call HLS ST, is an ST multiplier derived from High-Level Synthesis (HLS).Indeed, one of our objectives is to evaluate the capability of HLS to generate a competitive ST multiplier in terms of PPA compared to manually-designed Register-Transfer Level (RTL) implementations.We also add to the PPA comparison our radix-4 Booth ST multiplier, recently presented in [7].2) We provide an extended explanation of the architecture of our ST-based PS hardware accelerators for 2D-Convolution (2D-Conv), Depth-wise Convolution (DW-Conv) and Fully-Connected (FC) layers, developed using HLS and previously proposed in [8], [10], and [11], respectively.Specifically, we present a comprehensive overview of our accelerators design flow, from C/C++ to final hardware implementation.3) We also add the hardware support for uniform integer quantization (UIQ) [1], [13], [14] to quantize the output activations, which was not present in our previous accelerators' designs.4) We perform an extensive design space exploration (DSE) for our ST-based accelerators using HLS.This involves varying many knobs, including parallelism, clock frequency, and especially the type of ST multiplier, which is a novelty of this work with respect to [8]. 5) We illustrate the advantages achieved by our ST-based accelerators in terms of reduced latency and energy consumption, comparing them to accelerators equipped with non-ST fixed-precision 16-bit multipliers, which we call standard accelerators and standard multipliers, respectively.For this assessment, we integrate the accelerators into System-on-Chips (SoCs) under three different scenarios (low-area, low-power, and lowlatency) and, as case study, we execute the models of the MLPerf Tiny benchmark [15], previously quantized in mixed-precision (MP) with a custom version of QKeras [16] for which we release the source code.The results of the PPA comparison of ST multipliers showed that architectures having dedicated multipliers for each precision configuration tend to be less area-efficient than those employing a single but reconfigurable multiplier.Moreover, there is not a single winner that satisfy all PPA scenarios, but rather a set of optimal ST multipliers depending on the specific PPA constraints.
The results of the execution of the four MP-quantized MLPerf Tiny networks, using SoCs integrating ST-based accelerators tailored to different PPA scenarios (i.e., low-area, low-power, and low-latency), revealed: an average inference latency speedup, across the four models, of 1.46x, 1.33x, and 1.29x, respectively; a reduced average energy consumption, in most of the cases; and a marginal area overhead compared to SoCs equipped with standard accelerators.
The article is structured as follows.In Sec.II we present the related work, whereas in Sec.III we provide some background on DNN quantization, UIQ and the four MLPerf Tiny models.In Sec.IV we describe the concept of ST multiplier, we outline the architectures of the SoA ST multipliers, and we present the newly proposed ones.In Sec.V we detail the working principle and hardware architecture of our ST-based DNN accelerators, whereas in Sec.VI we describe the accelerators design flow.Finally, in Sec.VII we present the results in three parts: in the first we compare the ST multipliers in terms of PPA; in the second we report the Pareto-optimal accelerators, resulting from the HLS-driven DSE, in Latency vs Area and Power vs Area spaces; in the last we showcase the achievable latency speedup and energy reduction of the ST-based accelerators when running inference on MP-quantized MLPerf Tiny models, against standard accelerators.

II. RELATED WORK
Although the definition of ST mode was introduced with the subword-parallel BW ST multiplier of [6], earlier works already proposed reconfigurable multipliers that support both single high-precision multiplications and parallel lowprecision dot-products.
The authors of [17] and [18] introduce SIMD extensions to the Instruction Set Architecture (ISA) of a RISC-V processor featuring a multiplication unit that behaves like an ST multiplier.
In [19], a general-purpose systolic array for DL is proposed.It is made of reconfigurable Fusion Units (FUs) that exploit low-precision multipliers by dynamically merging or keeping their results separate.The architecture of these FUs falls within the divide-and-conquer (D&C) category, as per the taxonomy outlined in [9].To this category also belong the optimized versions in [12] and [20], and their ancestor in [21].
In [22], the authors present a reconfigurable fixed-point multiplier originally designed for digital signal processing (DSP) applications.
In [9], various PS MAC unit (PSMAC) architectures are benchmarked and categorized in subword-parallel, D&C and bit-serial.However, in [6] and [19] are the only ST-based PSMACs considered here.In [8], instead, we compared all the main previously described SoA ST multipliers in PPA.
Recently, we have also contributed to the SoA of ST multipliers with a subword-parallel radix-4 Booth architecture that requires a light-weight reconfiguration logic [7].
Regarding ST-based DNN accelerators, there are a few examples in the literature: in [19] the authors proposed a general-purpose systolic array architecture, in [6] the authors describe their implementation of an FC kernel, whereas we derived 2D-Conv [11], DW-Conv [10] and FC [8] layerspecific accelerators using an HLS flow.In particular, in [8] we also carried out a DSE varying several hardware knobs, from clock frequency to HLS directives, to explore a wide range of Pareto-optimal solutions in area, power or latency.The authors of [9] and [12] have already conducted an exhaustive comparison of various PS hardware accelerators, including [6], [19], [23], [24], [25], and [26].To the best of our knowledge, no other works have focused on employing HLS techniques for the development of ST-based hardware accelerators and PS accelerators in general.
In this work, we enrich the SoA portfolio by introducing two novel ST multipliers (a BW multiplier with a modified final adder and an ST multiplier derived from a functional C/C++ description by an HLS tool) and we perform a comprehensive PPA analysis of all the SoA ST multipliers.We then derive ST-based PS DNN accelerators, like in [8]; however, a distinctive feature of our work is the support for UIQ for the quantization of the final result, which is not mentioned in any of the previously cited accelerators.In this regard, we propose an accelerator design flow that includes minimizing the bitwidths of the fixed-point variables required by the UIQ formulas.We also expand our previous DSE [8] by introducing new hardware knobs.One of these is the selection of the type of ST multiplier inside the accelerators' MAC units, which can be chosen among all the manually-designed RTL descriptions of SoA ST multipliers (like in [8]), and also among the ST multipliers inferred by the HLS tool from a high-level description.Furthermore, we show the latency and energy benefits of ST-based accelerators, against equivalent accelerators based on standard multipliers, when running entire MP-quantized DNNs.Such comparison, except for our previous work that focused solely on isolated DNN layers [10], [11], has not been extensively examined in the literature.

III. BACKGROUND A. DEEP NEURAL NETWORKS' QUANTIZATION
The quantization of DNNs is now a common practice that decreases the numerical precision of weight parameters and activation values of neural networks layers.This process reduces the model size, lowering memory requirements to store weigths and activations, as multiple low-precision feature maps and weights can be efficiently packed into the same memory word [14].For the same reason, it also reduces data transfers costs.Additionally, quantization can improve inference latency, throughput and energy by taking advantage of high-throughput integer instructions, such as SIMD instructions in microprocessors [18], or specialized hardware operators like subword-parallel ST multipliers [7].
In this paper we focus on UIQ, even though various other quantization techniques exist [1].This choice is driven by the simple mathematical formulation, the availability in common ML frameworks (e.g., TensorFlow Lite), its efficient mapping on existing hardware (e.g., on 8-bit microcontrollers), and thus its widespread adoption on embedded devices for non-extreme quantization (> 2 bits) [1], [13], [27].Moreover, when it comes to ASIC implementation, integer/fixed-point math pipelines are more efficient in terms of silicon area and power consumption when compared to floating-point (FP) ones [28], not to mention the faster execution times.In the following, we introduce the UIQ mathematical background in the context of DNNs, borrowing some definitions from [13] and [14].Notice that, since we target ST-based accelerators only for the inference phase of DNNs, our focus is only on UIQ for inference, and not for training.

1) UNIFORM INTEGER QUANTIZATION
Given a set of real numbers in the real range [α, β] (e.g., a tensor with a high-precision FP format like FP32), UIQ maps each x ∈ [α, β] to an integer value x q ∈ [α q , β q ] represented uniformly on b bits, where [α q , β q ] is the quantized range: for asymmetric or symmetric signed integers it is equal to . The process of quantization is defined as: where s is the scaling factor, z is the zero-point (i.e., the integer value to which the real value zero is exactly represented), round is the rounding function (e.g., round-tonearest), and clip keeps the output range within the quantized range by saturating the outliers.In turn, s and z are defined from the chosen real and quantized ranges as: The opposite operation, which brings back x q to the real range, is defined as: where x is the closest real value (but not necessarily equal) to the original x, because rounding and clipping functions may introduce an irrecoverable error.
The quantization mapping discussed so far, with asymmetric ranges and z ̸ = 0, is known as affine quantization.Instead, when both ranges are symmetric, z becomes zero and (1) performs only the scale transformation.In this case, the quantization mapping is commonly known as scale or symmetric [29] quantization.Moreover, when s is a unique scalar value for all the channels of a tensor, quantization is referred to as per-layer; instead when s is a one-dimensional vector of scalars, each corresponding to a different channel of a tensor, quantization is called per-channel.

2) INTEGER-ONLY DNN KERNELS
Now consider the expression of an FC layer: where is the output array, C and K are the number of input and output activations processed by the FC layer, respectively.By applying (4) to each of the four real variables in (5), setting their own quantized ranges a priori, and moving the quantized output array Y q,k to the left hand side, we obtain the quantized FC expression valid for the k-th output activation: where X q , W q , b q , Y q are the integer values; s X , s W , s b , s Y are the scaling factors; and z X , z W , z b , z Y are the zero-points, associated with X , W , b, Y , respectively.Term (c) in ( 6) is the integer dot product, i.e., the core of the computation, instead term (d) introduces an overhead that causes a performance penalty.Both of them must be computed online because they depend on X q , which is known only at runtime.On the contrary, terms (a), (b), (e), and (f) are constant, thus can be computed offline.Notice that in case of scale quantization for weights and affine quantization for activations, which is a common practice in the literature [13], [14], z W and z b become null, and so also terms (d) and (f), while (b) simplifies.This is also our assumption in this work.The result of ( 6), before being assigned to Y q , is also rounded and clipped to fit the desired output quantized range of Y q (not shown in the formula for better readability).
The mathematical derivations of the integer-only kernels for 2D-and DW-Conv closely follow that of FC.We report them in Appendix A. Hereafter, we will refer to ( 14), (16), and (6) as the UIQ formulas.
Now we focus on the integration of the rectified linear unit (ReLU) into the expressions of the integer-only kernels.In fact, to optimize inference on DNNs in embedded devices, some adjacent DNN layers can be typically combined into a single one.This operation, called Layer Fusion, is usually performed between convolutional/fully-connected layers and the Batch Normalization (BN) or activation layers (e.g., ReLU), and can be applied to both FP and quantized models.Since our ST-based accelerators support layer fusion with ReLU, as elaborated in Sec.V-B, we explain here the fusion process considering an FC layer with a subsequent ReLU layer.We choose ReLU because it stands out as the most common activation function when it comes to efficient hardware implementations of DNNs.By applying the ReLU non-linearity to the FP output Y k of (5), we derive the expression of the FP FC-ReLU fused layer: where R k is the k-th output of the ReLU layer.By repeating the same steps that brought to the derivation of ( 6) from (5)applying (4) to each real variable of (7), setting their quantized ranges, and moving the quantized ReLU output R q,k to the left hand side-we obtain the quantized FC-ReLU fused layer valid for the k-th ReLU element: where s R and z R are the scaling factor and the zero-point associated to R q,k , whereas all the other variables are the same of those that appear in (6).Notice that R q,k undergoes a roundand-clip operation, not shown for clarity in (8), to fit into the desired quantized range of the ReLU layer.
The expressions for the quantized 2D-Conv-ReLU and DW-Conv-ReLU fused layers can be obtained through the same steps shown here for the quantized FC-ReLU.

B. MLPERF TINY BENCHMARK
The Machine Learning Performance Benchmark (MLPerf) is a widely recognized set of benchmarks in the field 44166 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
of ML created by the collaborative effort of more than fifty organizations from both academia and industry [15].In particular, the Tiny benchmark is a suite of four lightweight ML models representing real-world applications: Visual Wake Words (VWW), Image Classification (ImgClass), Keyword Spotting (KS), and Anomaly Detection (AD).MLPerf Tiny was designed to assess the performance of edge devices and ultra-low-power tiny ML systems with a limited energy, memory and/or computational power budget (such as mobile phones, microcontrollers, Internet of Things devices), by measuring accuracy, latency and energy during inference on those four ML models.In this respect, MLPerf Tiny is also a competition that encourages innovation in the field of Tiny ML [30].For these reasons, each application not only comes with its own dataset for development and testing, but also with a dedicated performance evaluation dataset (Perf test set).

1) VISUAL WAKE WORDS
The VWW dataset [31] is a collection of 109619 96×96 RGB images which contain persons or not-persons, derived from the MSCOCO 2014 dataset [32].The use-case of this dataset is for a device to wake up when a person is present, covering smart doorbell and occupancy applications.The model to use with this dataset is a smaller version of MobilenetV1, [33] that we define MobileNetV1Tiny.

2) IMAGE CLASSIFICATION
The ImgClass benchmark uses the CIFAR-10 dataset [34], which consists of 60000 32 × 32 RGB images belonging to 10 unique classes of 6000 images each.The use-case is for compact vision systems, including manufacturing, Internet-of-Thing sensor nodes, and autonomous agents and vehicles.The model to use is a custom ResNetV1 [35] that we define ResNetV1Tiny, which has no pooling layer after the first convolutional layer, fewer residual stacks, and lower dimension of filters and convolution strides than the original ResNetV1.

3) KEYWORD SPOTTING
The KS benchmark uses a large collection of English words, pronounced by persons with various accents, and derived from the Speech Commands v2 dataset [36].It contains twelve classes: ten with keywords (down, go, left, no, off, on, right, stop, up, yes), one with background noises and one with silence.The use-case is for human-machine interaction, including wakeword detection and remote control of smart devices by voice.The benchmark's target network is the small Depth-wise Separable Convolutional Neural Network (DS-CNN) of [37].

4) ANOMALY DETECTION
This benchmark uses one of the six machine types present in the DCASE2020 competition's dataset [38], the toy-car machine type (ToyADMOS [39]), which contains singlechannel 10-seconds length audio samples recorded from FIGURE 1. Reference ST multiplier, modified from [7].The 16-bit inputs (A and B) are partitioned in 4-bit chunks to enable multiple operations, as defined by the 3-bit configuration input (CONFIG) and shown in Table 1.The X /• symbol indicates that the multiplier is capable of multiplication and dot-product operations, depending on the configuration.TABLE 1. Supported precision configurations and operations of the reference ST multiplier of Fig. 1.The last three configurations correspond to dot-product operations at low precision.
seven different toy cars (1000 each) mixed with environmental noise.The use-case is early detection of machine anomalies, a common industrial problem.The model of this benchmark is the reference implementation of DCASE2020 which is an FC-based autoencoder [38] (thus, we name it FC-AutoEncoder).Differently from the other MLPerf Tiny models, the main metric used in AD is not accuracy, but the Area Under The Receiver Operating Characteristics Curve (AUC).

IV. SUM-TOGETHER MULTIPLIERS
The new ST multipliers that we introduce, as well as all the others that we analyze in this work, have I/O signals and behave as the reference component described in Fig. 1 and Table 1.Depending on the CONFIG configuration signal, this can perform one 16 × 16.16×8 multiplication, or two 8 × 8.8×4 or four 4 × 4 dot-products in parallel, using the signed operands packed in the 16-bit inputs A and B. Depending on the configuration, a subset or the entire 32 bits of the multiplier's output P contain the operation result.
We focus on these precisions for the following reasons.In applications that require utmost accuracy, a common choice is to use 16 bits to quantize activations and weights.Some examples are safety-critical applications, such as image segmentation in foggy environments for autonomous driving [5]; others are image processing applications that work with high-resolution satellite images, or high dynamic range (HDR) images and super-resolution [27].8 bits is the default precision to quantize DNNs while avoiding performance degradation [27] and is therefore the most commonly used.When smaller bitwidths for inputs and weights are needed, quantization techniques targeting 4 bits already provide an acceptable tradeoff between model size reduction and retained performance for most applications [1], [40].Instead, when dealing with extreme low-bit quantization (< 4 bits), existing methods incur a serious accuracy loss compared to the baseline, unless very extensive tuning and hyperparameter search is performed.Hence, this is still an active line of research [1].In light of these motivations, we work with ST multipliers that support operands with precision between 16 and 4 bits.
Regarding the asymmetric configurations (i.e., 16 × 8 and 8×4), we support them because they enable efficient packing of lower-precision operands, such as DNN weights, without compromising the precision of other operands, like DNN activations.Thus, they contribute to reducing the memory footprint of ML models.These configurations are used in SoA ML accelerators and processors [9], [18], and can also be found in commercial ML frameworks such as TFLite Micro. 1n the following, we first describe the architectures of the SoA ST multipliers as proposed in the literature and emphasize the differences between these and our reimplemented versions.Indeed, since the original SoA ST multipliers support a broad range of bitwidths for input and weights, we introduce minor modifications to align their configurations with the reference ST mentioned in Table 1.This is important to guarantee fair comparisons in all our experiments of Sec.VII.Lastly, we present the two newly proposed ST multipliers.

A. SOA ST MULTIPLIERS
The original subword-parallel BW multiplier of [6] is composed of a reconfigurable partial product matrix (PPM) and a final RCA [41].The PPM can be reconfigured to compute one 16 × 16 multiplications or 16/m dot-products at m = 8, 4, or 2 bit precision in parallel.Our re-implementation of [6] (made with a structural RTL description) is reported on the left side of Fig. 2(a).From top to bottom, it has the same architecture of the original version.We also draw, for clarity, the output concatenation block (&), which merges the least significant output bits coming from the PPM with the most significant ones exiting from the final adder.However, in our version we introduce the following modifications.First, we remove the 2-bit support from the PPM, since we use precision between 4 and 16 bits as motivated before.This can be seen from the precision of the main building block of the PPM, which is a 4 × 4 BW multiplier.Second, in low-precision configurations, we right shift the final output to the least-significant bit (LSB) position, sign-extending it to 32 bits (right shift & ext block).On the right side of Fig. 2(a), we illustrate how the PPM is reconfigured in the five operating modes of Table 1.In the 16 × 16 and 16 × 8 modes, all the PPs of the PPM contribute to the multiplier's output P and the result is represented on 32 bits, making valid-i.e., yellow in Fig. 2(a)-the entire result P. At lower precision, only the yellow PPs on the left-to-right diagonal of the PPM remain active and behave like two 8 × 8 (8 × 8 / 8 × 4 in Fig. 2(a)) or four 4 × 4 (4 × 4 in Fig. 2(a)) BW multipliers, respectively.These PPs produce the valid (yellow) output bits of P, which are less than 32 in this case and require the alignment to the LSB position.The remaining grey PPs are gated, using AND gates, and generate the invalid (grey) output bits.
The multiplication unit of RI5CY [17], a RISC-V processor featuring a SIMD ISA extension, comprises a standard 32-bit integer multiplier, a 32-bit fixed-point multiplier and two subword-parallel dot-product units.These units accept two 16-bit or four 8-bit operands (packed in one 32-bit register) and accumulate the 32-bit result in one cycle, hence performing simultaneously up to four multiplications and accumulations.The architecture of the two dot-product units consists of either two 17-bit multipliers or four 9-bit multipliers, respectively, followed by a compression tree.In [18], the same authors extend the multiplication unit with other two subword-parallel dot-product units supporting 4-and 2-bit operands, respectively.They also added the ISA support for some asymmetric configurations (8 × 4, 4 × 8 and 16 × 2).In this work, we implement the high-precision fixed-point multiplier and the two low-precision dot-product units of [17] as three mutually exclusive datapaths in a single design, scaling their precision to 16, 8 and 4 bits, respectively.Our re-implementation, illustrated in Fig. 2(b), uses a behavioral RTL description, as the authors of [17] declared that it gives the synthesizer the maximum optimization freedom.
The Fusion Unit of Bit Fusion [19] dynamically composes and decomposes 2-bit multipliers (called BitBricks) through a shift-and-add logic.It supports one 8 × 8, two 4 × 8, four 4 × 4, four 2 × 8, eight 2 × 4, sixteen 2 × 2 input/weight multiplications in one clock cycle.Several optimizations to the original work of [19] are proposed in [12] and [20], which reduce complexity and reconfigurability overhead of the shift-and-add logic at the expense of a lower number of supported input/weight precisions (2 × 2, 4 × 4, 8 × 8).However, the ancestor of all these D&C architectures is the reconfigurable and parallel inner-product processor of [21].This uses larger BitBricks on 4 or 8 bits and a higher input precision.In fact, each of the two input operands can accommodate one 64-bit, four 32-bit, sixteen 16-bit, or sixtyfour 8-bit items.It also maintains a fixed bitwidth for the two input operands, ensuring a constant memory bandwidth  [22]: four 8-bit Booth multipliers interconnected by muxes ending with an adder tree.(e) Radix-4 Booth ST (as in [7]): reconfiguration logic (blue), 16-bit Booth multiplier (white and gray).(f) HLS ST derived from HLS (proposed in this paper): four multipliers and three adders interconnected by a network of muxes and concatenations.across different configurations.This contrasts with the D&C architectures in [12], [19], and [20], which suffer from memory bandwidth explosion at reduced precision, as noted in [9].Among these D&C ST multipliers we implement the one from [21] for a fair comparison with the other SoA ST multipliers on an equal memory bandwidth basis, avoiding the problem of bandwidth explosion.In particular, we re-implement it with 4-bit BitBricks to support 16, 8 and 4-bit precision, as shown in Fig. 2(c): four FUs based on four 4-bit BitBricks each, interconnected by shift-and-add logic.We also right-shift its output to the LSB position and sign-extend it to 32 bits in low-precision modes (right shift & ext block as in [6]).
The reconfigurable fixed-point multiplier of [22] targets DSP applications and consists of four 16-bit Booth multipliers (without final adder), a configurable partial-products compression array and three configurable 33-bit adders.It supports symmetric (one 32 × 32, two 16 × 16 or four 8 × 8) and asymmetric (two 16 × 32) signed/unsigned multiplication operations, and dot product/double dot product operations (one or two 16 × 16±16 × 16 with saturation, one or two 16 × 16±16 × 16.16 without saturation, and one 8 × 8.8×8+8×8.8×8).In our version of [22], we remove the extra logic that is not strictly necessary to implement the reference ST multiplier behavior, such as the saturation logic or the subtraction in the dot-products.Next, we change the way the dot product is computed for all precisions.For example, in configuration 8 × 8 we swap the lower part with the upper part of operand B: A[15:8]×B [15:8] ± aL[7:0]×bL[7:0] of [22] becomes A[15:8]×B[7:0] + A[7:0]×B [15:8], as reported in Table 1.We also scale down maximum and minimum precision to 16 and 4 bits, respectively.The resulting architecture, shown in Fig. 2(d), features four 8-bit Booth multipliers connected by a network of multiplexers ending with an adder tree.
The subword-parallel radix-4 Booth architecture of [7] already supports operands at 16, 8, and 4-bit precision, as the reference ST multiplier in Table 1.As illustrated in Fig. 2(e), it is composed of a lightweight reconfiguration logic (in blue) placed between the two input operands and a standard 16-bit Booth multiplier.The latter, drawn in white and gray, features a Wallace's reduction tree with 4:2 compressors and a Carry Propagate Adder with Prefix Network [41].We implement the architecture of this Booth ST multiplier with a structural description as in our original paper [7].
At last, within all SoA ST multipliers that do not natively support the asymmetric configurations 16 × 8 and 8 × 4 (i.e., [6], [17], [21]), we add a sign-extension logic (not shown in Fig. 2 for better readability) that extends the lower precision operand B to either 16 or 8 bits before the actual multiplication operation.For this reason, zero-padding of the low-precision operands is not necessary in any configuration, as these operands always fully utilize all the parallelism of the multipliers' inputs A and B.
As a final note, we implement all of these SoA ST multipliers as signed multipliers.

B. BW-ADD: A BAUGH-WOOLEY ST MULTIPLIER WITH AN IMPROVED FINAL ADDER
In light of the results of our previous work [7], [8], we observe that the BW ST multiplier [6] is particularly area-efficient at clock frequencies lower than 600 MHz.At higher frequencies, the long diagonal critical path of the BW array and the carry chain of the final 16-bit RCA, highlighted by the purple dotted line in Fig. 2(a), are responsible for a significant area degradation [8], since the logic synthesizer must infer large logic gates to meet the stricter timing constraints.Thus, in this paper we address this problem by letting the logic synthesizer select the most suitable final adder implementation that meets the specified timing constraints with the minimum area, rather than forcing it to use an RCA.We name this multiplier BW-ADD.With this change, we expect a lower multiplier's area at high frequency, while remaining unaltered at low frequency, compared to [6].

C. HLS ST: AN ST MULTIPLIER DERIVED FROM HLS
As we present in Sec.V, we use HLS to generate the RTL of our PS DNN accelerators based on ST multipliers starting from a high-level description.To infer a specific implementation of an ST multiplier in the accelerators' MAC units, we force the HLS tool to import its RTL implementation.Usually, this RTL is described manually, as in the case of the SoA ST multipliers of Sec.IV-A.As an alternative, we decide to describe the ST functionality at a high-level and let the HLS tool, which in this work is Siemens Catapult, create automatically its RTL.The source code of this new ST multiplier, which we name HLS ST, is listed in Lst. 1.To easily access bit fields from integer data types, we use the method slc available in the Catapult C++ library ac_int.h(line 1): for example, A.slc<4>( 12) is a 4-bit subfield from bit 15 down to bit 12 of the int16 signal A (line 12).
By inspecting the RTL generated by the HLS tool, which corresponds to the schematic in Fig. 2(f), we notice that it contains one 16-bit, two 8-bit and two 4-bit multipliers, three adders with 8/12/16-bit bitwidth precision, and a network of multiplexers and concatenation blocks (&) that unpacks the 16-bit input operands, distributes them to the multipliers and merges their low-precision results into the final 32-bit output.Moreover, the result of low-precision configurations is already aligned to the rightmost LSB position.

V. ST-BASED HARDWARE ACCELERATORS A. WORKING PRINCIPLE
We now illustrate the working principle of our three DNN accelerators integrating ST multipliers in their MAC units.Fig. 3 shows the different access patterns (red) that the 2D-Conv, DW-Conv and FC accelerators use to read data from the activation (blue) and weight (orange) tensors, and how these data are packed in the 16-bit inputs of the ST multipliers.

1) 2D-CONV ACCELERATOR
For every orange filter with C kernels, a MAC unit of the 2D-Conv accelerator performs the multiplication of the C channels of the blue input tensor with the corresponding weight kernels, and the channel-wise accumulation of these multiplications.At full precision (N = 1), the ST multiplier within the MAC unit processes activations and weights from one input channel at a time.Instead, at lower precision the ST multiplier is fed with pairs of activation/weight data from two (N = 2) or four (N = 4) input channels at a time.This process is highlighted in red in the second column of Fig. 3 and allows to exploit the dot-product feature of the ST multiplier resulting in ideally fewer MAC cycles, which scale as C/N , and lower latency, which scales as 1/N .

2) DW-CONV ACCELERATOR
In DW-Conv, every output channel is the result of the convolution between the corresponding blue input channel and orange weight kernel, with no accumulation along the channel dimension, as it happens instead in the 2D-Conv case.Therefore, we need to use the ST multiplier in a different way than for 2D-Conv: we can accumulate the partial products between the N = 1, 2, or 4 input/weight pairs from the receptive field of the input tensor and the corresponding weight kernel.This new dataflow is reported in red in the third column of Fig. 3.
Compared to 2D-Conv, this accelerator has an overhead that affects the reduction of both MAC cycles and latency, as we show later in Sec.VII-C.This is because the number of accumulations is given by the square of the kernel size (K 2 ), which is not a multiple of N at lower precision (i.e., N = 2 or 4).Let us consider the 3 × 3 kernel of Fig. 3 as an example.With N = 2 or 4, we need five or three iterations, respectively, to accumulate the products of input activations and kernel weights.In the last iteration, however, only one input pair is within the receptive field of the kernel.As a result, we need to feed the ST multiplier with zeros in place of the missing low-precision operands, but this clearly results in under-utilization of the ST hardware.The number of MAC cycles for DW-Conv is ⌈K 2 /N ⌉ and the latency reduction scales as ⌈K 2 /N /K 2 ⌉, which typically is greater than 1/N , with this overhead decreasing as K increases [10].

3) FC ACCELERATOR
The working principle of this accelerator is shown in the last column of Fig. 3. To compute each element of the green output activation array (e.g., the one highligted in red), a MAC unit computes the dot product between the blue array of C input activations and one row of the orange weight matrix.The ST multiplier in the MAC unit takes N pairs at a time from the two arrays and either multiplies them in high-precision mode (N = 1), or performs a dot-product in low-precision mode (N = 2, or 4).Similarly to 2D-Conv, C/N subsequent accumulations are needed to complete the calculation.The process is repeated for every row of the weight matrix, until the green output activation array is complete.As a result, the number of MAC cycles and the corresponding latency scale as C/N and 1/N , respectively, like in the 2D-Conv case.

B. ACCELERATORS ARCHITECTURE
Our ST-based DNN accelerators share the same general architecture, outlined in the grey rectangle of Fig. 4. It consists of four parts as illustrated later in Secs.V-B1-V-B4: internal buffers (for input, weight, and output data), memory addressing and concatenating logics, reconfigurable ST-based PSMAC array, quantization logic and ReLU.
We obtain this architecture using the flow on the left side of Fig. 4, starting from a high-level C/C++ description of the ST-based accelerator (C/C++ (top) block) and using HLS techniques to generate the final RTL implementation.We provide a full description of this flow in Sec.VI-C.
Even though it is not the focus of our paper, we assume that the three accelerators share an on-chip global buffer (not shown in Fig. 4), as shown for example in [42] and other more recent papers on ML SoCs [43], [44].In particular, we assume that the global buffer is large enough to store at least two tiles (two is for double buffering) of each of the three relevant tensors involved in the execution of a single accelerator: input activation, weight, and output activation tensor.Indeed, we assume that the complete tensors have been fragmented in tiles [45] to exploit data locality in this on-chip global buffer.Moreover, we assume that an on-chip embedded processor invokes each accelerator to process those tiles one at a time.
Table 2 shows the maximum tiles dimensions (Part I) and the maximum tiles sizes (Part II) that would be stored in the global buffer.We determined these dimensions through a statistical analysis of the layer shapes of the most common DNNs for edge devices [10], [11] that are available in public Model Zoos for computer vision applications, such as TensorFlow [46], [47], Intel [48], [49], Xilinx [50], and Nvidia [51], [52].These networks include the well-known families of ResNet, MobileNet, and EfficientNet.Based on our survey, we select 18 and 22 as the input tile height/width (IH / IW ) and output tile height/width (OH / OW ) dimensions for 2D-and DW-Conv, respectively, because these values represent a reasonable trade-off between area of the global buffer and number of iterations over the tiles required by the accelerators to complete the DNN layers [10], [11].Height and width of weight tiles (KH / KW ) are instead 7 and 5 for 2D-and DW-Conv, respectively, to ensure that the accelerators support the majority of DNNs (e.g., ResNetV1 uses 7 × 7 kernels).
Regarding the input tile channels/activations (IC / IA) and output tile channels/activations (OC / OA), we vary their size during the DSE of ST-based accelerators as discussed in Sec.VII-B.The values explored are in the first two rows of Table 3, which also contains HLS directives and implementation constraints that we let vary during the DSE.We call these variables hardware configuration knobs because they affect how the RTL is synthesized by the HLS tool.We use 32 as maximum value for IC and OC because we found that the number of input and output channels of activations and weight tensors of common DNNs are often divisible by this value.For the FC accelerator, we select values of IA and OA starting from those used in [6], which were 256 and 8, respectively.Then, we add values in a power-of-two fashion to expand the spectrum of solutions for our design space and to ensure that the area covered by all three accelerators ranges approximately from a minimum to a maximum in the same manner.We will describe the remaining hardware configuration knobs later in this section.3.

TABLE 3.
Hardware configuration knobs explored in the DSE of Sec.VII-B, including maximum tiles size, HLS directives, and implementation constraints.
Let us now comment on the pseudo-code at the top of Fig. 4. It is a concise version of the high-level C/C++ description that produces the general architecture of the ST-based accelerators using HLS techniques.We first refer to this simplified code to highlight the commonalities between the high-level descriptions of the various accelerators.Then, we provide specific details on how the key parts of this code translate into the high-level C/C++ pseudo-codes of the three accelerators, reported in Lsts.2, 3, and 4, for 2D-Conv, DW-Conv, and FC, respectively.
After a series of pipelined outermost loops L1-L3, the accelerator reads activations from the internal input buffer (IBUF) and prepares the first operand A for the ST multiplier through the memory addressing and concatenating logic.For 2D-Conv and FC, this operation takes place before the innermost loop L4; however, in the case of DW-Conv, it occurs within L4 because there is no input channels loop in the DW-Conv algorithm.Then, in L4 the accelerator reads weights from the internal weight buffer (WBUF) and fills the second operand B. Subsequently, it performs the multiplication/dot-product operation using the ST multiplier configured via CONFIG and accumulates the result in the internal output buffer (OBUF).The latter keeps stored the result of the previous tile iteration, or is reset in L2 when the accelerator processes the initial input-weight pair of tiles of a layer execution (see the RESET signal in Lsts.2-4).
Since L4 is unrolled, the HLS tool synthesizes it by generating the array of M parallel reconfigurable ST-based PSMAC units shown in Fig. 4. To comply with the working principle presented in Sec.V-A, loop L3 needs to terminate earlier in low-precision configurations: this happens when the index of L3 reaches its maximum number of iterations (L3 max ) divided by N , where L3 max corresponds to the number of input channels for 2D-Conv, the product of the kernel dimensions for DW-Conv, or the number of input activations for FC, of the current tile execution.This is implemented by variables ic_lim, k_lim, and ia_lim in Lsts.2, 3, and 4, respectively.As the number of iterations of loop L3 decreases at reduced precision, the remaining readings from IBUF and WBUF are not performed.Thus, there is no need to fill with zeros the unused parts of these buffers.Finally, only when the accelerator creates the last output tile, OBUF undergoes quantization using the corresponding UIQ formula (i.e, Eq. ( 14) for 2D-Conv, ( 16) for DW-Conv, (6) for FC), followed by ReLU (when needed), preparing the output for the computation of the next layer.Otherwise, OBUF keeps accumulating the partial result/output inside the accelerator to avoid data transfers in the external memory, thus following an output-stationary dataflow [42].
Below we delve into the details of each architectural block of the ST-based accelerator illustrated in Fig. 4, highlighting the key differences between the three accelerators.

1) INTERNAL BUFFERS
Part III of Table 2 reports the accelerators' internal buffers sizes.These follow the same ordering of the parameters used by the tile sizes in Part II.For IBUF and WBUF of 2Dand DW-Conv we choose the minimum sizes that allow to compute 1 × 1×OC output elements.In particular, we size IBUF of 2D-Conv to store 4 input channels, to allow ST multipliers to operate in all precision configurations.For WBUF we choose the kernel dimensions of 7 and 5, following the weight tile dimensions of Part I.For FC we size IBUF to store 128 activations and OBUF to store OA output elements, to have a buffer area comparable with that of the other two accelerators.The internal buffers use double buffering to ensure uninterrupted operations by the accelerators while fetching new data from the global buffer.Thus, from the accelerators' point of view, the whole memory hierarchy composed of global buffer and internal buffers behaves as a unified virtual memory that they can access transparently.LISTING 4. Pseudo-code of our ST-based FC accelerator (inspired by [6]).
The internal input and weight buffers are organized in four 4-bit memory banks, named IBUF_A/B/C/D and WBUF_A/B/C/D, respectively, to enable reading lowprecision data according to the memory access patterns shown in Fig. 3.This is visible from the int4 datatype in the function signatures of Lsts.2-4.The output buffer is organized in 28-bit banks to match the bitwidth of the accumulators in the PSMAC array, as we will see in Sec.V-B3.
To guarantee the proper accelerators' execution, the internal buffers are filled by a Direct Memory Access (DMA) engine following the working principle illustrated in Fig. 3.For 2D-Conv, in configurations 16 × 16 and 16 × 8, one element of the input and weight tiles, once read from the global buffer, is extended to 16-bit (if needed) and split into four 4-bit chunks.Each input and weight chunk is then stored, from the most to the least significant, into IBUF_A-D and WBUF_A-D, respectively.In configurations 8 × 8 and 8 × 4, two input and two weight elements from the channels dimension of the corresponding tiles are extended to 8-bit (if needed) and split into 4-bit chunks.The chunks of the first input and the first weight are stored in IBUF_A-B and WBUF_C-D, respectively; the chunks of the second input and the second weight are stored in IBUF_C-D and WBUF_A-B, respectively.The 4-bit chunks are always stored from most to least significant.In the 4 × 4 case, four input and four weight elements from the channels dimension of the corresponding tiles are all extended to 4-bit (if needed), and then packed in IBUF_A/WBUF_D, IBUF_B/WBUF_C, IBUF_C/WBUF_B, and IBUF_D/WBUF_A, respectively.
For the FC accelerator, the process to fill the memory banks is similar to that of 2D-Conv.However, the number of the 4-bit memory banks is twice that of 2D-Conv (see lines 6-13 in Lst.For configuration 4×4, four consecutive input elements from the receptive field of the activation tile and four consecutive weights from the corresponding kernel of the weight tile are extended to 4-bit (if needed) and then stored in IBUF_A and WBUF_D only, leaving the other banks unused.
The data organization discussed above for the three accelerators is important as it enables the partitioning of the internal buffers into smaller memory banks (through the HLS directive interleave).This ensures that each bank contains all the data required by a single PSMAC unit to compute its own channel/activation output elements independently.In this way, the PSMAC array can compute M output channels/activations in parallel, as we show in detail in Sec.V-B3.However, to provide the input operands of ST multipliers in the PSMAC array in one clock cycle for all configurations, as it happens for 2D-Conv and FC, the memory organization of DW-Conv requires that IBUF_B and WBUF_C have two reading ports, and IBUF_A and WBUF_D have four reading ports, whereas all the other banks still have one reading port.As implementing four ports in SRAM ASIC technology would be critical, we decide to use latch-based memories for IBUF_A and WBUF_D.

2) MEMORY ADDRESSING AND CONCATENATING LOGIC
These two logic circuits are designed to implement the working principles outlined in Sec.V-A.Depending on the type of accelerator and selected configuration (as depicted in Fig. 3), the first is responsible for preparing the addresses to properly access IBUF and WBUF and retrieve the four 4-bit data from each memory bank (e.g., lines 31-35 and 41-45 of Lst. 2 for operand A and B, respectively).The second organizes these data into the 16-bit input operands of the ST multipliers through shift-and-mask operations (e.g., lines 36-37 and 46-47 of Lst. 2).For DW-Conv, these logic circuits are a bit more complex.Indeed, a pair of Look-Up Tables (LUTs) is required to retrieve the proper indexes, pre-computed offline, based on the values of CONFIG and k, where k is the iteration counter of the loop over the kernel (line 25 of Lst. 3).Moreover, as already discussed in Sec.V-A2, DW-Conv requires that two or three 4-bit chunks of A and B are filled with zeros in place of the missing lowprecision operands, in the last kernel iteration for N = 2 or N = 4, respectively (lines 44-48 of Lst. 3).

3) RECONFIGURABLE ST-BASED PSMAC ARRAY
The PSMAC array of our ST-based accelerators contains M MAC units, as shown in Fig. 4.Each MAC unit works on a distinct output channel/activation, processing a different filter for 2D-Conv, kernel for DW-Conv, or row of the weights matrix for FC.
The PSMAC array parallelism (M ), as listed in Table 3, corresponds to the unrolling factor applied to the innermost loops of the accelerators' high-level code through the HLS directive unroll.Specifically, M is equal to OC for 2D-and DW-Conv, and to OA for FC.This causes the HLS tool to fully unroll the innermost loops (line 39 for Lst. 2, 27 for Lst. 3 and 38 for Lst.4), because the unrolling factor matches their upper bound, thus replicating M times the ST-multiplier and the accumulation adder.As introduced in Sec.V-B1, to fully leverage this parallelism, we partition the internal buffers into M memory banks, enabling the PSMAC units to access their required data concurrently.For this purpose, we use the interleave directive with OC (for 2D-and DW-Conv) or OA (for FC) as argument.Table 3 also shows that the partitioning is not required for IBUF of 2D-Conv and FC since operand A is read outside the innermost unrolled loop (lines 31-37 in Lst. 2, lines 28-36 in Lst. 4).
For 2D-and DW-Conv, each MAC unit consists of one 16-bit ST multiplier (see the function call st_multiplier_function), one 28-bit adder and one 28-bit accumulation register (P) (lines 49-50 and lines 50-51 in Lsts. 2 and 3).For FC, we got inspired from [6], thus each MAC unit comprises two 16-bit ST multipliers (to process two activation/weight pairs in parallel), two 28-bit adders (to sum the outputs of the two multipliers and accumulate this result, respectively), and one 28-bit accumulation register (P1_plus_P2) (lines 50-53 in Lst. 4).This is also the reason why we have twice the input and weight buffers at the interface (lines 6-13 in Lst. 4).
The bitwidth of adders and accumulation registers is the result of the ablation study discussed in Sec.VI-B.

4) QUANTIZATION AND RELU BLOCK
This block implements the UIQ formulas ( 14), ( 16), and ( 6) (with z W = 0 and z b = 0 [13], [14]) into 2D-Conv, DW-Conv and FC, respectively.For an efficient hardware implementation, we convert the division by the output scaling factor s Y into a multiplication by its inverse.Additionally, we minimize the bitwidth of the C/C++ variables of the UIQ formulas through the ablation study described in Sec.VI-B.When the accelerator has processed the last pair of input/weight tiles needed to complete a specific output tile, the accumulated results in the PSMAC array are ready to be quantized using the UIQ formulas.In fact, the accumulated results correspond to term (c) in all the UIQ formulas ( 14), ( 16), and (6).The remaining variables of the UIQ formula are passed to the accelerator as inputs because they can be computed offline.
Furthermore, this block implements layer fusion between UIQ formulas and ReLU as described in Sec.III-A1.Thus, when ReLU is needed, the accelerators can be configured to execute it in hardware.The related pseudo-code, omitted for simplicity, would be at lines 58, 57, and 59 of Lst. 2, Lst. 3, and Lst. 4, respectively.
Finally, all accelerators support per-layer quantization for activations, and per-layer or per-channel quantization for weights, as the latter offers superior performance for DNN quantization, as shown in [14] and [27].

VI. ACCELERATORS DESIGN FLOW
To obtain our ST-based hardware accelerators, we use the design flow outlined in Fig. 5.It consists of the following three steps, which are analyzed in detail in Secs.VI-A-VI-C: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.A) MP Quantization and Fine Tuning.Quantizing a set of DNN models in MP is the first step of the proposed flow.For this paper, we choose as case study the MLPerf Tiny benchmark [15] because its four networks are well-suited for edge devices, which are the main target for our accelerators.Specifically, we quantize activations and weights of its models on 16-, 8-or 4-bit integers, the same precisions supported by our ST-based accelerators.B) Minimization of UIQ Variables Bitwidth.The second step is an ablation study aimed at optimizing the hardware accelerators using an iterative approach.In this process, we gradually reduce the bitwidth of the C/C++ fixed-point variables of the UIQ formulas, and, for every bitwidth selection, we evaluate the performance of the MP-quantized models, obtained from step A) on their test sets.This process ends by reporting the minimum bitwidths for which the models do not exceed a user-defined degradation threshold.C) Generation of Hardware Accelerators.Using the optimal bitwidth precision determined in step B), we perform a DSE in both latency vs area and latency vs power for each accelerator.In the exploration we vary many hardware configuration knobs, as listed in Table 3, including HLS directives (e.g., pipelining and unrolling) and type of ST multiplier for the PSMAC array.

A. MP QUANTIZATION AND FINE TUNING
We build the first step of our design flow on top of QKeras [16] For the bit-width exploration, we use AutoQKeras [16], an extension of QKeras that employs Bayesian Optimization to determine the optimal number of bits for each DNN layer.We constrain weights and activations to INT16, INT8, or INT4 bits, and biases to INT31 3 or INT16, since it is well known that quantizing biases to lower precisions significantly hurts model performance [13], [27].We configure AutoQKeras to maximize a score function that is the product of the validation metric of the quantized model (bounded between 0 and 1) and the total bit reduction with respect to a 16-bit flat reference model (i.e., a model with all activations and weights quantized to INT16 and biases quantized to INT31).The total number of bits of a model is the sum of the products between the number of activations/weights of each layer and the number of bits used to represent them.In our case study, we use the validation accuracy as validation metric for all MLPerf Tiny models except for FC-AutoEncoder, whose output metric is the mean squared error loss between input and output predictions (MSE loss ).To map the +inf-0 range of the MSE loss to the 0-1 range of the other validation metrics (as required by AutoQKeras), we create the following custom validation metric for the AutoEncoder: 1/(1 + MSE loss /10).
For each network, we use AutoQKeras to iteratively sample from the search space a different combination of feature map, weight, and bias bitwidths for each layer.Then, we let AutoQKeras fine-tune the resulting MP network for a few epochs starting from pre-trained FP weights, when available, to shorten the bitwidth exploration; otherwise, we let it train the model from scratch.The training is performed using QKeras' quantization-aware training engine.In our case study, since we can rely on the pre-trained weights provided by the MLPerf Tiny repository, 4 we follow the first approach.To further speed up the exploration, we use subsets of the full training and validation sets, together with early stopping.We interrupt AutoQKeras' search when the validation score reaches convergence, i.e., stabilizes around a fixed value.This happens approximately after 200, 400, 100, and 200 search iterations for MobileNetV1Tiny, ResNetV1Tiny, DS-CNN, and FC-AutoEncoder, respectively.Finally, for each model we select the bitwidth combination that gives the best validation score and we conclude by fine-tuning it.In our case study, we use the default settings of the training scripts included in the MLPerf Tiny GitHub repository.
In Sec.V-B, we did not discuss the hardware implementation of BN arithmetic, which we decide not to support in our accelerators to keep lightweight designs.This is not a limitation because BN parameters can be efficiently folded offline with the weights of adjacent convolutional layers using a technique known as BN folding [14].BN folding is a standard procedure for accelerating DNN inference in embedded devices, as BN parameters remain constant after training.To ensure that applying BN folding to the final MP models obtained from AutoQKeras' exploration would not result in folded weights exceeding the supported bitwidths of our accelerators (16, 8, 4 bits), we proactively provide the FP models to AutoQKeras with pre-folded weights since the beginning of the exploration.Thus, we replace QConv2D+BatchNormalization with QConv2DBatchnorm, and QDepthwiseConv2D+BatchNormalization with QDepthwiseConv2DBatchnorm. At the time of our experiments QKeras did not support BN-fused layers for FC layers (i.e., QDenseBatchnorm was not yet available).Thus, in our case study we do not apply BN folding to FC-AutoEncoder, as shown in the architecture of the final MP FC-AutoEncoder model (Table 11, Appendix B).
The final MP-quantized MLPerf Tiny models are reported in Appendix B. Their FP and MP performance on the corresponding Perf test sets, using AUC (for FC-AutoEncoder) and accuracy (for the other three models), are provided in columns 4 and 5 of Table 4, respectively.To ensure a solid FP baseline for our comparison with MP models, we reevaluate the performance of the FP models in our software environment, rather than blindly relying on the values reported in [15] (86, 86.5, 92.2, 88.0, for MobileNetV1Tiny, ResNetV1Tiny, DS-CNN, FC-AutoEncoder, respectively).For this task we use the pre-trained weights and test scripts provided by the MLPerf Tiny repository.In Table 4 we also report the total bits reduction of MP models against their 16-bit flat quantized counterparts (column 7), which are the reference models used by AutoQKeras for guiding the minimization of its objective function, as discussed earlier.
The results show that the MP models exhibit approximately a 1% decrease in accuracy compared to their FP counterparts while still meeting the MLPerf Tiny Quality Targets (column 3), which correspond to the performance that the models should retain after quantization and other optimizations [15].Moreover, the total bits reduction (column 7) is than 50% for all models, confirming the effectiveness of the MP optimization performed by AutoQKeras.

B. MINIMIZATION OF UIQ VARIABLES BITWIDTH
Meeting the hypothetical constraint of zero computational errors in UIQ formulas would require mathematical operators (i.e., multipliers and adders) with excessively large bitwidths, due to the propagation of the bit precision through the involved mathematical operations.This would result in an impractically large accelerator area or could even prevent the HLS tool from generating feasible solutions.Therefore, in this second step of the design flow, we perform an ablation study to optimize the hardware accelerators by reducing the bitwidth of the C/C++ variables used in the UIQ formulas.
Let us consider the UIQ formula (6) of FC, with z W = 0 and z b = 0, as our reference.The same reasoning holds for the UIQ formulas of the other accelerators.The variables that we consider for the ablation study are listed in the column header of Table 5.The first six are the actual variables shown in the UIQ formula, whereas the last three are the intermediate results v1 q,k , v2 q,k , v3 q,k obtained from the decomposition of ( 6) in ( 9)-( 12): where Y q,k is the k-th output element, with k ∈ [1, K ], quantized on INTy bits (y = 16, 8, or 4) on the integer quantized range , and all other variables are those introduced alongside (6) in Sec.III-A1.In our accelerators we implement each of these variables as a fixed-point or as an integer number.
Our ablation study aims at optimizing the hardware accelerators using an iterative hardware-software co-design approach.As a preliminary step, we replace the invocations of the low-level TensorFlow routines inside the QKeras QConv2DBatchnorm, QDepthwiseConv2DBatchnorm, and QDense layer classes, with the invocations of the HLS C/C++ code that describes the corresponding accelerator.Then, we start by performing a statistical analysis of the maximum and minimum values taken by each variable.This analysis involves running inference on the MP-quantized models obtained in the previous step of the flow.The inference is performed on small calibration subsets extracted from 44178 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 4.
Performance of MLPerf Tiny models (column 1) on the corresponding Perf test sets (column 2), using AUC for FC-AutoEncoder and accuracy for the other three models, for their FP (column 4), MP (column 5) and MP with optimal C/C++ bitwidths (column 6) versions.

TABLE 5. Minimum bitwidths resulting from the ablation study.
the corresponding test sets.In this way, we determine the least number of bits of the integer part of each fixed-point variable that retains the maximum MP performance (i.e., AUC for FC-AutoEncoder, or accuracy for the remaining MLPerf Tiny models).Afterwards, with these numbers of integer bits as starting point, we perform a bitwidth exploration of the C/C++ variables of the UIQ formulas (including intermediate variables v1, v2 and v3): we iteratively decrease the number of fractional and/or integer bits, considering one C/C++ variable at a time, and evaluate the effect on the test metric of the considered models by performing inference on their full test sets (the Perf test sets for MLPerf Tiny models).We stop the exploration when it is no longer possible to reduce precision without a reduction greater than a certain threshold in the performance metric of at least one of the analyzed DNNs.In our case study we set a threshold of 0.5% with respect to the MP-quantized test metrics in column 5 of Table 4.In the future, we plan to find these optimal bitwidths automatically through hardware-aware training [53].
The so-obtained optimal bitwidths for the MLPerf Tiny benchmark are in Table 5, whereas the inference results on the MP-quantized MLPerf Tiny models, obtained by invoking the accelerators in software with these bitwidths, are reported in column 6 of Table 4. From these results we observe that: FC-AutoEncoder has an additional penalty of 0.38% against the FP model; MobileNetV1Tiny and ResNetTiny show no further accuracy loss; for DS-CNN, there is even a slight improvement of 0.4%, which is a positive side effect of the quantization process that may occasionally occur [14].We use these optimal values to synthesize the accelerators in the third step of the accelerators design flow (Sec.VI-C).

C. GENERATION OF HARDWARE ACCELERATORS
In the last step of our design flow, we generate the ST-based accelerators using HLS as shown in the left part of Fig. 4. The procedure consists of two steps.The first performs the actual HLS process by invoking Siemens Catapult (HLS block) with the following three inputs: The distinction between the two modes will be explained later in this subsection.3) A set of hardware configuration knobs, sampled from Table 3, and a set of HLS constraints and directives, e.g., clock frequency, unrolling, pipelining, partitioning (hardware configuration knobs block).The second step (Implem.block) involves the logic synthesizer, in our case Synopsys Design Compiler (DC), which receives two inputs: 1) The RTL of the accelerator generated by the HLS tool; 2) A set of implementation constraints and usual logic synthesis directives, e.g., clock frequency and clock uncertainty, input/output ports delays, driving/load cells, compilation strategy.We use the HLS directives to perform several optimizations.As mentioned in Secs.V-B1 and V-B3, we fully unroll the innermost loops in Lsts.2-4 with the unroll directive and partition in banks the accelerator's memories with the interleave directive.This combination infers the M parallel MAC units in the PSMAC array and ensures parallel data accesses.For all the other loops we set the Initiation Interval to 1 to pipeline their execution and increase the accelerator's throughput.When the HLS tool is not able to find a suitable schedule of the operations that satisfies the timing constraint, we remove pipelining from the outer-most loops (more details in Sec.VII-B).The clock frequency constraint is common to both high-level and logic synthesis.However, in the HLS tool we also set an additional constraint: a clock uncertainty of 50% through the Clock Overhead directive, which divides the target clock period in half to take into account the next steps of the flow that might increase the delay, such as routing [54].This technique helps reduce the critical paths in the generated RTL by pushing Catapult to insert additional control steps.As a consequence, the logic synthesizer can achieve the desired timing with smaller logic gates.
Concerning the kind of ST multiplier used in the MAC array (RTL/C/C++ (ST) block), we have two options.The first is to let the HLS tool map the C/C++ function of the ST multiplier in the high-level description (st_multiplier_function) to one of the seven RTL descriptions reported in Table 3.For this we use the directive map_to_operator (e.g., line 3 of Lst. 2), followed by the name of the multiplier's RTL top-level entity that we want to use X = { [6], [7], [17], [21], [22], BW-ADD, HLS ST }.In other words, each ST multiplier type is treated as an IP block called Catapult C Optimized Reusable Entity (CCORE) that the tool uses in place of the st_multiplier_function function call.In this case, the ST multiplier code is not synthesized along with the accelerator during the HLS process, but is rather instantiated as a component in the generated accelerator's RTL code.We call this first option IP mode in Table 3.The second option, not explored in [8], is to let the tool inline the C/C++ function of the ST multiplier in the top high-level description of the accelerator, so that it gets synthesized along with the rest of the accelerator.We call this second option Inline mode in Table 3.In this case, we just have to comment out the map_to_operator directive from the accelerator's C/C++ top function.Based on Catapult's documentation [54], implementing a function that is called multiple times as a CCORE (in our case the st_multiplier_function function subject to the unroll directive) is expected to improve design regularity and reduce the shared logic of multiplexers, leading to a better area efficiency.However, we experiment also with function inlining because the advantages of using CCOREs are not always guaranteed and are design-dependent.For example, the operators inside of the CCORE (e.g., multipliers) will not be available for sharing with any other operator of the same type outside the CCORE's boundaries.

A. PPA COMPARISON OF ST MULTIPLIERS
To compare all the ST multipliers considered in this paper and identify the best in PPA, we follow the same methodology of our previous work [8], which constists of synthesizing their RTL descriptions using DC, on a 28-nm CMOS technology at 0.9 V, after adding I/O registers.
Fig. 6 reports the results of area and power vs clock period obtained by varying the target clock frequency from 0.5 to 1.5 GHz in ten steps.The solutions with the lowest area or power for a given target clock period represent Pareto-optimal points and are connected by a solid black line representing the Pareto front.In both plots, we exclude the right-most outliers to prevent the compression of the left and most significant solutions.Power is calculated using random input bits evenly distributed between zero and one.Although this approach may not faithfully represent realistic ML workloads, it still allows for a valid comparative analysis.In the area vs clock period graph, the Booth design [7] shares the primacy with [6] at 500 MHz (2 ns), then outperforms the other designs from 600 (1.67 ns) to 1400 MHz (0.71 ns) thanks to its low reconfigurability overhead compared to a standard Booth multiplier, as discussed in [7].
The design of [17] is instead Pareto-optimal in area only at 1500 MHz (0.67 ns).The reason lies in the heuristics of the logic synthesizer.Due to the behavioral description of this ST multiplier, the tool has greater freedom in selecting the best implementation for the internal multipliers and adders in terms of area and timing.As the clock constraint tightens, the tool progressively discovers more area-efficient solutions.Conversely, when the constraint is less stringent, the optimization process halts earlier upon finding solutions that satisfy the desired clock period.
Our new BW-ADD is among the best in area in the lowfrequency range, being second best from 700 (1.43 ns) to 800 MHz (1.25 ns), closer to the Pareto front than the original BW [6].Our results confirm that the BW architecture, although very efficient at low frequencies, is not suitable for higher frequencies [41], even with a faster adder, due to the inherently long critical paths of its BW PPM.Solutions based on dedicated multipliers for each configuration (like [17], [22], HLS ST) are inefficient in area because of the redundant logic gates not shared among different operating modes.In other words, their internal multipliers operate in a mutually-exclusive manner based on the specific operating mode.Instead, single high-precision multipliers working in a subword-parallel manner (like [6], BW-ADD and [7]) have a higher utilization ratio of their logic gates, which is reflected in a lower area, especially when the timing constraint is not too strict.
The D&C [21] is the second to last in terms of area, which is most likely due to the shift-and-add logic that connects the low-precision multipliers.
In the power vs clock period graph, all solutions are in general very close.The most relevant results are the following: from 400 to 800 MHz, the optimal ST multipliers are those with a BW architecture (e.g., [6] and BW-ADD); from 1000 to 1300 MHz, [7] progressively dominates over [21] and [22]; at high frequencies [17] turns out to be the most power efficient.
To sum up this comparison of ST multipliers, the optimal solutions depend on the PPA constraints: [7] offers the best trade-off in area vs clock period for most of the frequencies, [6] and BW-ADD prove to be Pareto-optimal in power at low frequencies, whereas [7] and [17] are the best in both area and power at high frequencies.

B. DSE OF ST-BASED ACCELERATORS
Similarly to what we did in [8], we perform a DSE in area, power and latency on a 28-nm CMOS technology for the three ST-based accelerators.We use the HLS flow described in Sec.VI-C and vary hardware configuration knobs, implementation constraints, accelerators' internal buffers, and maximum tile sizes, according to the values in Table 3.We also vary the target clock frequency (last row of Table 3) in ten steps from 100 to 1000 MHz, which we verified being the maximum clock frequency reachable by all the accelerators, and the kind of ST multiplier used in the MAC array, for which we have the IP mode or the Inline mode (HLS ST Inline in the keys of Figs.7-8), as explained in Sec.VI-C.
Despite the suboptimal performance of certain ST multipliers, as indicated by our findings in Sec.VII-A, we still incorporate all types of ST multipliers into the DSE of ST-based accelerators to verify whether the ranking observed at the multiplier level remains consistent at the accelerator level.
As introduced in Sec. 4, when the HLS tool fails to meet the target clock frequency, we disable pipelining from some of the outer-most loops of the accelerators.In Table 6 we report the combinations of accelerator type, clock frequency value, OC value, ST multiplier type, and loop name for which we disable pipelining.
Area and power of the accelerators are measured through DC, with the same methodology of Sec.VII-A.The latency of each accelerator point is determined by multiplying the TABLE 6.For loops of the high-level C/C++ descriptions Lsts.2-3 for which we disable pipelining in order to allow Catapult HLS to find a schedulable design.We use the loop index as a reference to the loop.execution time required by the accelerator to process one tile by the total number of tiles into which a reference DNN layer is divided.Such reference layer depends on the accelerator type and is represented by the following (input tensor, weight tensor) pair: (16 × 16×256, 3 × 3×256 × 256) for 2D-Conv; (112 × 112×32, 3 × 3×32) for DW-Conv; (1024, 1000 × 1024) for FC.The first is the most frequent layer among the selected DNNs for edge devices (Sec.V-B); the second is the first depth-wise layer of MobileNetV1; the third is rather arbitrary because FC layers vary significantly from one network to another.In any case, by experimenting with many other tensors sizes, we obtain very similar DSE trends as those reported in Figs.7-8, which can be therefore extended to any DNN layer.Furthermore, we plot the results normalized so as to make them layer-independent.
Figs. 7-8 do not report the results of the entire DSE, but only the Pareto-optimal points.To obtain the two figures, we project these points from the tri-dimensional PPA space to two bi-dimensional spaces of Latency vs Area (LA) and Latency vs Power (LP), respectively.An illustrative example of a complete DSE for the 2D-Conv accelerator is instead reported in Fig. 9, which shows the extensive range of design variations explored.The points in Fig. 7 connected by the solid line (the Pareto front) and labeled in black are LA-optimal (Pareto-optimal in the LA space), whereas those labeled in red are LP-optimal, that is, they belong to the Pareto front in the Latency vs Power plot in Fig. 8.These labels denote the number of input/output channels for 2D-Conv (m, n = IC, OC), output channels for DW-Conv (m = OC), or input/output activations for FC (m, n = IA, OA) used to generate the corresponding accelerator point, according to the notation introduced in Table 2.In a dual manner, Fig. 8 reports the LA and LP projections on the same Latency vs Power graph: this time the black labels identify the LP-optimal points, whereas the red labels mark the LAoptimal ones.
We observe that the majority of optimal points in the LA space are suboptimal in the LP space, and vice versa.Consider an SoC designer aiming to allocate an area of 0.06 mm 2 for a 2D-Conv accelerator.The designer might select solution (B) with (32,8) input/output channels pair optimized at 400 MHz, achieving a normalized latency of 0.007, and using IP [7] as ST multiplier.However, for the same latency, the power-optimal choice becomes solution (C) with (32,16) input/output channels pair optimized at 100 MHz, having IP [6].Note that (C) uses 1.6x more area than (B), whereas (B) consumes around 2.5x more power than (C).
There are also a few points that are optimal in both LA and LP projections: for instance, the DW-Conv accelerator (D), designed for low latency, featuring a 32 channels and operating at 400 MHz with HLS ST Inline.
The designer can also optimize the trade-off in the PPA space by choosing solutions that are LA(LP)-optimal and sit close to the LP(LA) Pareto front.For example, solution (E) with (1024, 4) activations pair, with normalized latency 0.23 and IP [7] at 100 MHz is LA-optimal, but is also very close to the LP Pareto point marked with (F) and using HLS ST Inline, with 18% of power overhead.Conversely, LP-optimal solution (G), with (1024, 32) activations pair and IP [21], is also a valid solution in the LA space with only 8% area overhead with respect to the nearest LA Pareto point (H), which uses HLS ST as IP.
More in general, from Figs. 7-8 we observe that: • The DSE and the PPA results are especially sensitive to two main variables that control the PSMAC array 44182 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.parallelism: OC and OA.As such parallelism increases, the number of MAC units and the size of weight and output memories increase.This leads to an increase of area and power, but more output channels/activations can be simultaneously computed, thanks to the higher number of MAC units, thus reducing latency considerably.
• As expected, very low clock frequencies (≤ 200 MHz) have to be preferred when low area and/or low power are the goals.On the other hand, medium-high clock frequencies are necessary to achieve higher performance.
• The majority of Pareto points for 2D-Conv and FC have always large values of IC and IA, respectively.In fact, increasing these values reduces the overall latency by decreasing the number of tiles N T and increasing the size of each tile S T ( = IS, WS, or OS according to Table 2).This is because, even though the product N T × S T is constant, as N T decreases the overall latency decreases in the same proportion, since fewer tiles correspond to fewer times that the accelerator is executed; at the same time, each execution has a latency that increases less than proportionally as S T increases.This is visible in Lst. 2 and Lst.4: the latency contribution of the loops that do not depend on IC or IA is amortized by the increased latency of the loops that depend on those variables.
• There is a strong correlation between the Pareto-optimal ST multipliers of Fig. 6 and the types of ST multipliers used in the dominating accelerators in Figs.7-8.This is evident especially in the LA space, where a large percentage of the accelerators that sit on the Pareto front (75% for 2D-Conv, 93% for DW-Conv, 80% for FC) have a PSMAC array based on the ST multiplier that we proposed in [7], the dominant point in the Clock Period vs Area subplot of Fig. 6.A few Pareto points, however, are based on BW-ADD and HLS ST, which are indeed sub-optimal in the top graph of Fig. 6 but sit close to the Pareto front.This is because sometimes the optimization heuristics of the logic synthesizer manage to obtain slightly better results with those ST multipliers.Notice that in Fig. 7 no Pareto-optimal accelerators are based on the ST multipliers that are largely sub-optimal in the top graph of Fig. 6 ( [17], [21], [22]).
Similarly, in the LP space, since in the Clock Period vs Power graph at the bottom of Fig. 6 the ST multipliers are all very close to the Pareto front, the dominant accelerators present a more heterogeneous distribution of ST multiplier's types and the choice of the best IP depends on the designer's actual PPA constraints.
• Accelerators with ST multipliers designed manually in RTL are not always the best choice.In fact, there are accelerators with HLS-based ST multipliers or fully-obtained from a C/C++ description (HLS ST Inline) that belong to the Pareto front.In particular, there are some design points using HLS ST in the LA space, and many more using HLS ST Inline in the LP space.To conclude, the outcomes of the accelerators' DSE do not reveal a single winner, but rather a wide variety of Paretooptimal solutions, offering SoC designers the flexibility to choose the most suitable implementation aligned with their target, being low area, low power, or high performance.We will see a practical example in the following subsection.

C. INFERENCE ON MP-QUANTIZED MLPERF TINY MODELS
In this subsection, we showcase the benefits in latency and energy achieved by ST-based accelerators when running inference on the four MLPerf Tiny models, quantized in MP as discussed in Sec.VI-A.This is achieved through a comparative analysis against standard accelerators.These accelerators use standard 16-bit multipliers and sign-extend to 16 bits both activations and weights when quantized with a lower precision.
We carry out this comparison in three different constrained PPA scenarios: low-area, low-power, and low-latency, the latter being defined with a significantly larger area constraint than the first.
For each scenario, from the DSE plots of Figs.7-8 we select a set of ST-based accelerators to be integrated in a hypothetical SoC with the global buffer and an embedded processor.The set comprises one 2D-Conv, one DW-Conv and one FC accelerator, all having the lowest latency while satisfying the given area or power constraint.The processor orchestrates the sequential execution of each layer of the MP-quantized MLPerf Tiny models exploiting tensor tiling and the transparent memory transfers to/from the external memory due to the double buffering mechanism (as explained in Sec.V-B).In particular, for synchronization between embedded processor and accelerators, the double buffering mechanism ensures a smooth and synchronized execution of two subsequent tensor tiles.This method involves utilizing double buffers, enabling the immediate start of the next tile's execution without delay, as the required data for the subsequent tile is already available thanks to the DMA engine.The latter is initialized by the processor at the start respectively.In the low-latency scenario the benefit in energy is less evident and sometimes even unfavourable for STbased accelerators.This is because the selected ST-based accelerators for this scenario (marked with L in Fig. 7) process many output channels in parallel thanks to the unrolling directive.This implies that part of the reconfiguration logic of ST-based accelerators is replicated, increasing the area and power overhead of ST-based accelerators against standard ones, which do not have the reconfiguration logic.In particular, ST-based DW-Conv accelerator is the one with the most complex reconfiguration logic of the three ST-based accelerators.Not surprisingly, the two models for which the energy of ST-based accelerators actually increases compared to standard ones are MobileNetV1Tiny and DS-CNN.
These results suggest that ST multipliers are well-suited for 2D-Conv and FC, but not for DW-Conv.However, we are already tackling these inefficiencies by developing a new PS DW-Conv accelerator based on another kind of subword-parallel multiplier which works in a Sum-Apart (SA) mode [6].This has the same configurations of the ST multiplier, but does not return the sum of the low-precision multiplications; instead, it keeps them separate, side-by-side, in the multiplier's output.The new working principle of the SA-based DW-Conv accelerator would allow to multiply one high-precision, or two/four low-precision elements from the input and weights channels in parallel, without summing them together, but maintaining the multiplication results separate to adhere to the DW-Conv algorithm.
Finally, we estimate the area overhead of SoCs equipped with ST-based accelerators against SoCs using standard accelerators, for the three scenarios.Other than the three accelerators (internal buffers included), we include a small processor (i.e., Zero-Riscy [56], cache included) and the global SRAM-based buffer.The results show that SoCs with ST-based accelerators exhibit a limited area overhead of 0.9% in the low-area scenario, 2.5% in the low-power one, and 8.0% in the low-latency one, compared to the standard counterparts.

VIII. CONCLUSION
In this paper, we presented our contribution in the area of DNN accelerators using precision-scalable Sum-Together (ST) multipliers.We started by introducing two new ST multipliers (a Baugh-Wooley with modified final adder and one derived from High-Level Synthesis (HLS)), and we made an exhaustive comparison of all the state-ofthe-art ST multipliers in terms of power, performance and area (PPA).We then provided detailed insights into the working principles, hardware architectures and design flow of three layer-specific ST-based DNN accelerators for 2D-Convolution, Depth-wise Convolution and Fully-Connected layers, supporting uniform integer quantization.We showcased the Pareto-optimal accelerators resulting from the HLS-driven design space exploration (DSE) in Latency vs Area and Latency vs Power spaces.The results of this DSE allow designers to select the best type of ST multiplier in conjunction with the best configuration of hardware parameters for a given target in the PPA space.Lastly, we demonstrated pros and cons of our ST-based accelerators integrated into a System-on-Chip (SoC) with different design requirements: low-area, low-power, and low-latency.We reported the achieved latency speedup and energy reduction on the inference of MP-quantized MLPerf Tiny models, as a case study, and also the area overheads of ST-based accelerators, when comparing against SoCs with equivalent accelerators based on non-ST fixedprecision 16-bit multipliers.In the future, we plan to solve the inefficiencies of the ST-based Depth-wise Convolution accelerator discussed in this paper, by implementing a novel accelerator based on Sum-Apart (SA) multipliers.

APPENDIX A INTEGER-ONLY DNN KERNELS FOR 2D-AND DW-CONV
In this appendix we use the notation of Table 2

LISTING 1 .
The C/C++ source code of the HLS ST multiplier.

FIGURE 4 .
FIGURE 4. General architecture of the ST-based accelerators (bottom right), HLS flow (bottom left), and pseudo-code of the high-level C/C++ description (top) that produces the general architecture.

TABLE 2 .
Description of accelerators' parameters related to the tiles (Part I and Part II) and accelerators' internal buffers sizes (Part III).The values of the parameters explored during the DSE of Sec.VII-B, denoted by the DSE entry, are listed in Table

LISTING 3 .
Pseudo-code of our ST-based DW-Conv accelerator.
4) for a reason clarified in Sec.V-B3.In configurations 16 × 16 and 16 × 8, two consecutive activations from the input tile and two consecutive weights from the same row of the weight tile are read from the global buffer, extended to 16-bit (if needed) and split into four 4-bit chunks.From the most to the least significant, the four chunks of the two inputs are stored into IBUF1_A-D and IBUF2_A-D, while those of the two weights are stored into WBUF1_A-D and WBUF2_A-D, respectively.In configurations 8 × 8 and 8 × 4, four consecutive activations and weights are read along the input array and the same weight matrix row, respectively.Then, they are all extended to 8-bit (if needed) and each is split into two 4-bit chunks.The two chunks of the four inputs are stored in this order: IBUF1_A-B, IBUF1_C-D, IBUF2_A-B, and IBUF2_C-D, whereas those of the four weights are stored in this order: WBUF1_C-D, WBUF1_A-B, WBUF2_C-D, and WBUF2_A-B.4-bit chunks are always stored from most to least significant.In the 4 × 4 case, eight pairs of consecutive inputs and weights are read, extended to 4-bit (if needed), and stored in the internal memory banks as follows: 1st in IBUF1_A/WBUF1_D, 2nd in IBUF1_B/WBUF1_C, 3rd in IBUF1_C/WBUF1_B, 4th in IBUF1_D/WBUF1_A, 5th in IBUF2_A/WBUF2_D, 6th in IBUF2_B/WBUF2_C, 7th in IBUF2_C/WBUF2_B, 8th in IBUF2_D/WBUF2_A.Filling the memory banks of DW-Conv for configurations 16 × 16 and 16 × 8 follows the same steps of 2D-Conv.However, for low-precision operating modes the filling process is different.For configurations 8 × 8 and 8 × 4, two consecutive input elements from the receptive field of the activation tile and two consecutive weights from the corresponding kernel of the weight tile are extended to 8-bit (if needed) and split into 4-bit chunks.The chunks of the first and second inputs are stored in IBUF_A-B, whereas those of the first and second weights are stored in WBUF_C-D, leaving IBUF_C-D and WBUF_A-B unused.

1 )
The top C/C++ high-level description of the ST-based accelerator to generate (C/C++ (top) block).It reflects the pseudo-codes of Lsts.2-4; 2) The description of the ST multiplier type to use in the PSMAC array (RTL/C/C++ (ST) block): an RTL Intellectual Property (IP) block (IP mode) or an inlined C/C++ function (Inline mode).

FIGURE 6 .
FIGURE 6. DSE of the SoA and newly proposed ST multipliers.

FIGURE 7 .
FIGURE 7. Latency vs Area: results of DSE for 2D-Convolution (top), Depth-wise convolution (middle), and Fully-Connected (bottom) accelerators.Points with black and red labels are Pareto points in Latency vs Area and Latency vs Power, respectively.

FIGURE 8 .
FIGURE 8. Latency vs Power: results of DSE for 2D-Convolution (top), Depth-wise convolution (middle), and Fully-Connected (bottom) accelerators.Points with black and red labels are Pareto points in Latency vs Area and Latency vs Power, respectively.

FIGURE 9 .
FIGURE 9. Example of a complete DSE for 2D-Conv.

TABLE 9 .
MP-quantized model of ResNetV1Tiny (using QKeras' syntax and with the new QActivation layer implementing affine uniform quantization).L marks the left branches, R the right ones.

1 KH kh = 1 KW kw = 1 X
. As mentioned in Sec.III-A2, the quantized kernels of 2D-and DW-Conv are derived similarly to FC.Let us start from non-quantized 2D-Conv: Y oh, ow, oc = b oc+ IC ic = oh+i, ow+j, ic • W kh, kw, ic, oc ∀ oh ∈ [1, OH ], ow ∈ [1, OW ], oc ∈ [1, OC](13) [13]Keras extension tailored for quantization tasks.It provides drop-in replacement for some layers to transform a FP Keras model into a quantized one.It supports quantization-aware training by implementing fake-quantized layers and straight through estimator for back propagation.Since QKeras supports affine uniform quantization for weights but not for activations, we create a new activations layer class to implement Eqns.(1)-(4), resulting in a new version of QKeras for integer-arithmetic-only inference.This new version behaves similarly to TFLite[13], but, differently from TFLite, it also supports precisions lower than 8 bits for activations and weights.We release this modified version of QKeras on GitHub as open-source code.
2As we show inin Appendix B for the four MLPerf Tiny models, we insert the new activation layer (called QActivation) before and after each Conv2D, DepthwiseConv2D, and Dense layer.

TABLE 8 .
MP-quantized model of MobileNetV1Tiny (using QKeras' syntax and with the new QActivation layer implementing affine uniform quantization).

TABLE 10 .
MP-quantized model of DS-CNN (using QKeras' syntax and with the new QActivation layer implementing affine uniform quantization).

TABLE 11 .
MP-quantized model of FC-AutoEncoder (using QKeras' syntax and with the new QActivation layer implementing affine uniform quantization).
X ∈ R IH ×IW ×IC is the tensor of input activations, W ∈ R KH ×KW ×IC is the weight one, b ∈ R OC is the bias array, Y ∈ R OH ×OW ×OC is the output tensor; (IH , IW ) and (OH , OW ) are the dimensions of the input and output tensors, IC