Introduction
Edge AI offers attractive advantages, including low latency, power efficiency, reduced bandwidth, and enhanced data privacy [1]. However, the high computational complexity of deep neural networks (DNNs) has been a significant obstacle to their deployment at the edge devices. To address this issue, researchers have employed optimization techniques such as quantization [2], [3], [4], [5], pruning [6], [7] and highly efficient model design [8], [9], [10], [11]. These methods reduce computational load and memory footprint, enabling the deployment of advanced DNNs on resource-constraints edge platforms.
The widespread adoption of edge AI has led to diverse application requirements, such as low power consumption, minimal latency, high accuracy, and high efficiency. In response, a broad scope of accelerators have been proposed, ranging from highly flexible and efficient accelerators [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] to operating at ultra-low power consumption [23], [24], [25], [26], [27], [28]. Notably, bit-scalable accelerators accelerators [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] provide arithmetic designs that realize variable bit-precision at runtime, natively supporting mixed-precision operations. It is known that DNNs require different bit-precision levels at different layers [4], [5], such adaptability is crucial for enhancing efficiency.
However, endpoint edge devices present dynamic computational needs that vary both with environmental conditions and over time [29]. Such variability introduces three core challenges in edge environments. 1) Memory budget: Conventional DNN accelerators typically necessitate the use of multiple models to adapt to fluctuating computational demands over time. For instance, surveillance cameras may require different levels of computational resources based on variables like time of day and human activity. The need for such multiple models increases the memory footprint, a critical issue for edge AI systems operating under strict memory limitations. 2) Power constraints: Devices at the extreme edge, including surveillance cameras, have not only highly variable computational needs but also face stringent power restrictions. They have requirements for ultra-low power operation near sub-mW range [30]. 3) Operational flexibility: There is a need for the ability to efficiently process a wide range of neural network models, given the diversity of applications in versatile edge environments. In summary, edge AI systems require adaptive solutions that respond to changing resource demands while maintaining efficiency and minimizing memory and power consumption.
To address these issues, we propose Pianissimo [31], a sub-mW class inference accelerator with progressively adjustable bit-precision. Pianissimo is based on the concept of adjusting the computational complexity according to the inference difficulty: using more computation for complex tasks and less computation for easy tasks. Pianissimo mainly supports the following two features: 1) adaptive model switching to extract models of various bitwidth versions from a single model and avoid unnecessary computations in simple tasks; 2) dynamic processing control to process only the regions of interest (ROI) specified by the image sensor. The adaptive model switching is based on the authors’ proposed ProgressiveNN [32], which extracts high bitwidth representations from a single model, reducing the computational complexity for simple tasks. Dynamic processing control reduces computational complexity by only processing the ROIs from the image sensor.
At the heart of our design is a novel datapath architecture with progressive bit-serial datapaths. Our proposed accelerator is distinct from previous flexible bit-precision accelerators [13], [14], [15], [16], [17], [18], [19] in terms of allowing low to high bitwidth representations with a single weight. By adopting a bitwise quantization representation and bit-serial accumulation scheme from the most significant bit (MSB) to the least significant bit (LSB), our design ensures high flexibility despite the simple circuit design. Additionally, our bit-serial datapath can be straightforwardly but efficiently extended to mixed precision. This accumulation scheme allows maximum utilization of the processing element (PE) while ensuring without reducing its functionality.
To further enhance the model-level efficiency, we support well-designed depthwise separable convolutional neural network (DSCNN)–based models [9], [10], [11], [33] with two different datapaths: depthwise (DW) and pointwise (PW) layers. DW and PW layers, introduced in MobileNet [8], have brought innovative lightweight to computational efficiency. By dividing the convolution layer into DW for spatial extraction and PW for channel extraction, MobileNet achieved significant improvements in both computational and parameter efficiency without compromising inference accuracy. Therefore, the lightweight advantage offered by these is crucial for edge AI with limited computation and memory resources. To efficiently process PW and DW layers, we employ two datapath designs to handle distinct DW and PW feature extraction dimensions [34]. Moreover, Pianissimo seamlessly handles the transposition of output feature maps to accommodate these varied processing orders. This transposition allows for layer-type-specific input data supply.
Progressive bit-serial DW/PW and block skipping processing are overseen by software-hardware (SW-HW) cooperative control using a reduced instruction set computer (RISC) processor and HW counter complex; the RISC also contributes to bit-serial PW/DW as well as dynamic processing skipping control using sensor information. Our work shows that the control scheme integrated with the RISC and the HW counters provides a solution that significantly increases the flexibility of the edge AI inference with a bit of power overhead under ultra-low power.
The remainder of this article is structured as follows. SectionII introduces the recent bit-level flexible accelerators and ultra-low-power accelerators. SectionIII presents two core algorithms of the proposed accelerator, facilitating adaptive inference at the edge. SectionIV introduces a sub-mW class inference accelerator called Pianissimo that features a progressive bit-by-bit datapath. SectionV shows evaluation results using advanced small NNs. Finally, SectionVI concludes this article.
Related Works
A. Bit-serial/decomposed accelerators
Bit-serial computation straightforwardly provides flexibility for neural networks by its fineness. Stripes [13] provided accuracy and performance flexibility by computing a single input operand in a bit-serial manner. UNPU [16] offers a similar trajectory and improves area efficiency by incorporating lookup table-based PEs.
Differing from the fully bit-serial computation approach adopted by Stripes and UNPU, Bit Fusion [15] unveiled a fused-PE. This innovative design dynamically configures itself based on the bit-precision of the input operands. It achieves the mixed precision of power of two by spatially distributing bits as 2-bit bricks and then merging them appropriately. The bit-serial scheme of Pianissimo stands out by storing all lower bitwidth values in a single weight value, expanding these values over time.
There is a growing interest in the utilization of bit-level sparsity for further optimization [14], [17], [18], [19], [20]; leveraging sparsity is highly efficient as it eliminates the need for superfluous zero operations in the datapath [35], [36], [37]. Bit-Pragmatic [14] introduced bit-level sparsity compression to the activation using bit-serial computation, significantly improving the computational efficiency of NNs. Bit-Tactical [17] presented value-level weight skipping in addition to using bit-level activation sparsity. Bit-Tactical also reduced job latency by allowing data movement to neighbor lanes to address the issue of the load imbalance caused by sensitive skipping processing. Laconic [18] introduced bit-level sparsity for both activation and weights. Bitlet [19] reduced the load imbalance problem by reducing the need for synchronization using bit-interleaved PE that treats skipping nonzero values irregularly aligned. Ristretto [20] enabled value and bit-level skipping of both weight and activations by streaming flattened bit-brick sequences with compression of nonzero bricks.
The sparsity utilization of bit-level and value-level suggests further efficiency gains in Pianissimo. However, it is also crucial to consider potential challenges. Introducing such fine-grained speedup might result in critical overhead to circuit area and power consumption matters, especially when operating within ultra-low-power conditions.
B. Ultra-low power ML accelerator
The domain of ultra-low power machine learning (ML) accelerators presents a complex challenge: executing the desired NN process with minimal energy. Some accelerators have navigated this challenge in sub-mW/nW, specializing in tiny models for specific applications [23], [24], [28].
Shan et al. [23]has created a keyword spotting, including mel frequency cepstrum coefficients, chip fabricated with a 28 nm process, operating at a mere 510 nW. The NN core uses a binary DSCNN and a small register file-based memory block to achieve 94.6 % on a two-word classification task while minimizing data movement and computational cost. Lu et al. [24] proposed a 65 nm chip running at
In recent years, NN accelerators have been proposed that combine ultra-low power consumption and flexibility [25], [26], [27]; Pianissimo is one of these. Jokic et al. [25] proposes a face recognition system that combines CNN and the binary decision tree and reduces power consumption by running the power-hungry CNN core only when necessary. The CNN core supports 1-bit and 16-bit weights. Park et al. [27] achieve high-quality speech enhancement at 740 uW with band optimization that dynamically adjusts computational complexity based on the frequency band. Furthermore, the use of 4-bit logarithmic quantization allows for a PE that operates without the need for multipliers, and its PE supports DW and PW layers. TinyVers [26] is a highly flexible accelerator despite its ultra-low power. TinyVers’ PE array (PEA) supports two datapaths, broadcast, and multicast of weights, with the RISC-V processor overseeing the entire process. The chip fabricated with 22 nm FDX with embedded magnetoresistive random access memory (eMRAM) runs on several types of NNs, including ResNet-8 of MLPerf Tiny [33], and achieves high tera operations per second per watt (TOPS/W) with ultra-low power consumption. Ont achieved using BS, measu supports DW layer dataflow, which is critical for edge AI, and provides adaptive bit-precision with a single weight. A detailed comparison between Pianissimo and these ultra-low power accelerators is presented in Section-V-E.
Adaptive Algorithms Behind Pianissimo
Pianissimo was designed for adaptive inference at extreme edge environments. To accomplish this, Pianissimo employs an adaptive adjustment between inference accuracy and computational complexity, as illustrated in FIGURE 1. This adjustment strategy is achieved through two essential model-level algorithms: ProgressiveNN and Block Skip (BS). Both algorithms excel in dealing with information available at edge environments, each offering a unique approach to adaptive computational complexity management. ProgressiveNN allows for model switching based on the difficulty of the input task as shown in FIGURE 1 bottom. BS focuses on trimming unnecessary computations outside the ROI, as shown in the top of FIGURE 1. Pianissimo leverages these adaptive processing algorithms to enable adaptive inference and improves efficiency for edge AI applications. This Section details these two algorithms, clarifying the core design concept behind Pianissimo.
Adaptive adjustment of the accuracy-computation tradeoff based on ProgressiveNN and block skip.
A. ProgressiveNN
ProgressiveNN is a flexible bit-precision network that dynamically switches the bitwidth of NN weights according to the task difficulty. ProgressiveNN, proposed by the authors, features 1) bitwise numeric representation that allows MSB to LSB computation and 2) batch normalization (BN) retraining to improve accuracies of low-bitwdth models.
Whereas general ML accelerators employ fixed-point representation as their main computation scheme, Pianissimo adopts ProgressiveNN’s bitwise binary quantization scheme. Each value is quantized bitwise, with each binary digit representing either +1 or −1. The value can therefore be seen as a set of decomposed binary values.
ProgressiveNN has a nested structure where the MSB is the outermost in the bitwise decomposed values, as shown in FIGURE 2(a). The main difference with other numerical representations is that this nested representation is processed in order from the outermost MSBs. In more detail, ProgressiveNN obtains target high bitwidth values by accumulating from MSB to LSB computation while considering its digit’s place value. We describe this accumulation scheme using the fully-connected layer.
ProgressiveNN algorithm. (a) In ProgressiveNN, each weight digit represents either +1 or −1, and the accumulation from MSB to LSB yields a high bitwidth network. (b) As the bitwidth expands, quantized values asymptotically approach the intended target values.
For the sake of simplicity, we explain this accumulation scheme using the fully-connected layer with \begin{align*} z_{j} &= \sum _{i=1}^{C} w_{i,j} x_{i} + b_{j} \\ &= \sum _{i=1}^{C}\sum _{n=1}^{N} w_{i,j}[n] x_{i} \cdot 2^{n-1} + b_{j} \tag{1}\end{align*}
\begin{equation*} z_{j} = \sum _{i=1}^{C}\sum _{n=N-M+1}^{N} w_{i,j}[n] x_{i}\cdot 2^{n-1} + b_{j}. \tag{2}\end{equation*}
Thus, only one set of
B. Block Skip
BS is an integrated approach crafted to amplify computational efficiency by suppressing non-essential processing outside regions of interest (ROIs) designated from low-power event-driven sensors like one proposed in [40]. Its idea is based on the concept of elevating the efficiency of applications that seamlessly merge ultra-low-power edge AI with sensor technology. This integration aims to leverage the information harvested directly from endpoint sensors to optimize computational demands.
When the sensor detects motion and identifies the region of interest, a binary mask is supplied to Pianissimo accelerator with the ROI position as one and the rest as zero. As illustrated in FIGURE 3, an inference procedure is then initiated using only this masked data, thereby excluding irrelevant areas and increasing efficiency. The figure shows that the grey block of the intermediate feature map indicates that specific operations are unnecessary and save power consumption by skipping computations. In processes that require a reduction of the intermediate feature map, such as downsampling processes, the ROI mask is consistently reduced and maintains its size with the changed data.
BS significantly advances in vision tasks with limited movement or narrow scope. One example is in the area of security cameras. These devices are often installed in zones where sporadic or slight motion is detected, and in such scenarios, these cameras often capture a little or small action from a wide angle of view. By employing BS in such situations, computational loads can be drastically reduced, realizing significant efficiency improvement.
This dynamic process control is conducted by a RISC processor, which oversees the processing of input segments and instructs where to use BS processin; dynamically adjusts processing based on ROI mask data to ensure that only essential data is processed; and ensures that the RISC processor is able to process the input segments in the same way as the RISC processor. Details of the RISC control are described in Section IV-C. RISC also has dedicated instructions for reducing the BS mask in downsampling.
Pianissimo
We designed the inference accelerator called Pianissimo to realize ultra-low-power yet flexible inference at extreme edge circumstances. The key feature is the progressive bit-by-bit datapath, enabling ProgressiveNN inference and ensuring it meets various requirements in edge environments. Pianissimo incorporates dynamic model switching and dynamic BS processing via an integrated SW-HW control approach. In addition, SectionIV-E describes two usecase-level algorithms using ProgressiveNN: adaptive precision (AP) and mixed precision (MP).
A. Pianissimo Overall Architecture
FIGURE 4 displays the overall Pianissimo architecture that realizes the adaptive inference shown in FIGURE 1. The architecture mainly consists of five parts: unified memory (UMEM), PEA, post-processing unit (PPU), controller (CTRL), and clock gating unit (CGU).
For the datapath, a three-layer memory hierarchy–buffer (ABUF/WBUF), cache (ACM/WCM), and unified memory (UMEM)–is used to suppress the power consumption by maximizing data reuse and minimizing data movement. As shown in the bottom of FIGURE 4, UMEM is a configurable 7.5 Mb memory space for both weight and activation in which the sizes of weight memory (WMEM) and activation memory (AMEM) are 1-4 Mb, and 3.5- 6.5 Mb, respectively. Such adaptability is crucial to accommodate this study’s diverse neural network (NN) demands on-chip. For instance, when dealing with significant weight parameters or extensive bitwidth, the approach is to increase the proportion of WMEM. Contrarily, when the activation is prominent, the balance of AMEM is increased to meet the various requirements.
The ACM and WCM adopt direct-mapped caches focusing on the spatial and temporal locality in NN inference. The 64 Kb ACM/WCM is more energy-efficient in terms of reading and writing operations compared to the more expansive 512 Kb UMEM (from memory specifications). Therefore, utilizing the cache becomes particularly beneficial when data is reused more than thrice. However, considering the power consumption of transferring data from the UMEM to the ACM/WCM, bypassing and deactivating the ACM/WCM is more power-saving if reuse falls below this threshold. The 64 Kb memory used as a cache consumes 38% less power when reading data than the 512 Kb memory used in UMEM for memory used in our design. When data is accessed three times, it requires three cache reads and one UMEM read. This power consumption is lower than reading from the UMEM three times.
The controller consists of a customized RISC processor with dedicated instructions for BS and a counter complex for bit-serial multiply-accumulate (MAC) operations. The controller architecture includes both an instruction memory (IMEM) and a data memory (DMEM), each holding a memory capacity of 512 Kb. The IMEM is spacious enough to store the instructions required to execute NN models. Concurrently, the DMEM can keep all the BN parameters for high bitwidth sets on-chip. The cooperative control of the on-chip RISC processor and HW counters allows for flexibility and speed of model switching and ROI processing. The CGU governs the entire core to improve power efficiency further, reducing redundant power consumption.
Two
The PPU processes three tasks: the BN process, the clipped rectified linear unit function, and conversions of the quantization format. The BN operation is executed in a 16-bit floating point (FP16) format to enhance inference accuracy, as mentioned in [41]. Therefore, the accumulation results in a 22-bit integer format are converted to FP16 and then converted to an 8-bit block floating point after the affine process, where a common 5-bit exponent is directed from the RISC. The 16 post-processed results are then packed and written to AMEM.
B. Bit-serial PE and PW/DW Dataflows
As illustrated in FIGURE 5(a), ProgressiveNN is realized with a straightforward bit-serial PE. This PE is primarily composed of a sign-inverter and a shifter.
The core arithmetic unit. (a) ProgressiveNN is realizing with sign inversion and shift operations. (b) The register column corresponding to each PEC accumulates the partial sums.
In Pianissimo, the weights, where each bit digit representing the binary value {−1, +1}, are fed in a bit-serial manner, transitioning from the MSB to the LSB. The sign inverter inverts the sign of the input activations according to the corresponding binary weights. Subsequently, the PE column (PEC), with 8 PEs, aggregates the partial sums of the PEs, and the shifter multiplies the place value given by its weight digit. In the case of
To put it simply, the ProgressiveNN PE is essentially the specialized PE for binary NNs [42], augmented with a shifter. Compared to a fixed-point MAC PE with an 8-bit multiplication and a 22-bit accumulation, a bit-serial MAC PE shows a circuit overhead of around 23% in simulation using Synopsys Design Compiler. For a fair comparison, the bit-serial PE includes eight 1-bit operations on 8-bit values and their addition instead of 8-bit multiplication. However, this approach offers the gain of progressive progressively adjustable bit-precision.
A noticeable distinction lies in the numerical representation. In general numerical expressions, such as fixed-point numerical expressions, the value is calculated from the LSB, which causes carry and an increase in the number of digits. However, the calculation can be interrupted by processing the weights bit-serially from the MSB. This unique approach facilitates the implementation of high bitwidth weights without compromising the efficiency and utilization of the PE.
A PEA consists of 8 PECs (
Pianissimo supports both PW and DW layers, which are essential for inference at the edge. Each of these layers is executed via different datapaths to utilize the PEA computation resources fully. 2D convolution operations (Conv2D) can be sequentially processed as multiple PW operations. FIGURE 6 shows the bit-serial PW and DW dataflows. This figure also highlights a three-level memory hierarchy that supplies weights and activations (center of FIGURE 6).
In the PW mode, each PEA takes as input an 8-bit activation for each of the eight rows and a 1-bit weight for each of the
In DW mode, the PEA handles 8-bit activation for every diagonal and 1-bit weights for each row. The requisite number of activations and weights fluctuates based on the kernel size and stride (FIGURE 6 right). The ABUF and WBUF adjust the input accordingly depending on the size, and the CGU deactivated any superfluous PEs to tackle this. Unlike in PW mode, the AMEM groups 16 sets of 8-bit activations into a single word, organized in a row-major order. As for weights, they are stored with a specific memory mapping, as illustrated in FIGURE 7(b) right, to suit the varying DW kernel sizes of 3, 5, and 7. This scheme sequences weights for the eight grouped kernels in the order of priority: row, channel, bit, and column. This structure is optimized to handle multiple kernel sizes and to ensure the better allocation of bit-serial weights to these sizes. The grey hatches indicate zero padding. In DW mode, eight kernel rows are utilized simultaneously, making the memory readout design emphasize row readout efficiency.
Finally, OBUF accumulates the resulting partial sums from both modes, transposes the output direction on demand, and passes them to PPU. The details of the OBUF transpose for a single PEA are illustrated in Fig. 8. In scenarios without the transpose operation, the accumulation results are read vertically from the register column during each cycle, from time t0 to t7, with the specific output determined by a multiplexer. In contrast, when the transpose function is activated, the accumulated result is solely read from register column 0 throughout the output phase. During each cycle in this mode, data is read horizontally from each register column. Each register column then transfers its own data to the adjacent column on the left. The output direction is selected by the transpose flag managed by the RISC processor, resulting in 8 accumulative results from one PEA being sent to the PPU. Therefore, the PPU handles 16 outputs from 2 PEAs in parallel.
Transposition of the output direction in OBUF for a single PEA. 16 resulting outputs from two PEA are sent to the PPU in parallel.
C. SW-HW Cooperative Control
FIGURE 9 shows the SW-HW cooperative control by the customized RISC processor and the HW counter complex. The RISC is equipped with 32 registers, each 32-bit. Register R00 is the zero constant register, and R01 is the special purpose register (SPR) for the core kick and RISC sleep flag. Registers R02 to R07 are general-purpose registers. The remaining registers, R08 to R10, R11 to R15, and R16-R31, are specialized for sublayer control, layer control, and batch normalization parameters, respectively.
The HW counter implements a quadruple nested loop that the RISC processor orchestrates. This loop configuration allows pianissimo the flexibility to operate in different modes, enabling it to switch between PW and DW modes and adaptively select the bitwidth for bit-serial weights.
Fig. 10 shows a timing chart of the RISC processor and the HW counter complex. Initially, the RISC sets up the control information and parameters necessary for subsequent core processing in the background of the core processing. It triggers the beginning of the RISC operations and subsequently goes into a deactivated state managed by the CGU. Once kicked off, the nested loops of the HW counter start to move, and the PEA follows the loops and operates the processing instructed by the RISC. After all loop processing is completed, the red chunk in Fig. 10 is obtained, and control is returned to the RISC. Once the output chunk is obtained, the RISC repeats the process until the end of the layer. The RISC performs BS by checking the ROI mask and skipping this chunk generation step.
Timing chart for the control using the RISC and HW counters. THe RISC is gated after the parameter set and become active after HW loop is completed.
FIGURE 11 shows a bit-serial PW/DW processing flow with 4-bit weights. The loops with grey hatches are processed in parallel in the PEAs. In PW mode, as shown in FIGURE 11, both the input data and weights are supplied in channel-major order, with weights being supplied bit-by-bit. The innermost loop of PEA is the output channel in consecutive PW layers and row direction in the PW-DW layer. For example, in a consecutive PW, PEA0 treats the 0–7 output channels, and PEA1 is 8–15 output channels. Outside the weight loop is an input row loop of up to 8, where the weights are reused in the time direction using WBUF. The outermost processes the input sub-channel loop, completing the inner product operation.
DW/PW Control flows using 4-bit weights. The gray hatches in loop frameworks indicate parallel processing in PEA.
DW control flow is shown in FIGURE 11. The difference between PW and DW in terms of kernel geometry occurs in the first and third loops from the outside. In the case of PW, it is the kernel’s horizontal and vertical loops, whereas in DW, it is the horizontal and vertical loops of the kernel. The difference is due to the difference in kernel shape between PW and DW. In DW mode, the innermost loop of PEA is the output channel loop. Additionally, ABUF autonomously manages padding in the width direction, while padding in the height direction is handled through RISC-controlled processing skipping.
D. Power Management
Minimizing power usage is crucial for achieving ultra-low power operation. Pianissimo adopts a fine-grained gating strategy using the CGU, which turns off the clock input to idle registers on pipeline. The gating is applied to various components, such as UMEM I/O, WBUF/ABUF, PEA/OBUF, PPU, and the RISC processor, but DMA peripherals and CTRL units except for the RISC processor. As partially depicted in the figure (see FIGURE 10), both the RISC processor and the PPU have shorter execution times compared to the PEA, making them ideal candidates for significant power savings through CGU. Moreover, in the PEA, idle PEs are actively gated, particularly in cases involving DW layers with small kernel sizes, layers with small input feature maps, or layers not extra-allocated to a PEA’s size.
UMEM and WCM/ACM incorporate three distinct power management strategies: light sleep (LS), deep sleep (DS), and shutdown (SD). As shown in FIGURE 12, these power-saving modes significantly reduce power consumption compared to the normal operating mode. Specifically, LS, DS, and SD cut leakage power by 31 %, 58 %, and 82 %, respectively. Note that the power consumption is normalized in the figure. These modes are offered as a function of memory.
Leakage power consumption with 512 Kb UMEM’s power management modes. The vertical axis is normalized with reference to the normal mode.
LS is designed for modest but immediate power savings and is dynamically applied throughout the inference process. DS is employed for memory components that are temporarily inactive, offering more substantial power reductions at the cost of longer resumption times. The SD mode is initiated for unused memory spaces during the inference process. While UMEM is designed with a comparatively larger memory space to accommodate a diverse range of models, its energy efficiency is optimized by using the SD mode adequately, thus keeping the power overhead to a minimum, even when executing smaller models with small memory requirements. In our design, since SD remains unchanged throughout the execution, it is directly managed externally through flags in the control register for DMA. On the other hand, DS and LS flags are overseen by the RISC to control flexibly.
E. Adaptive/Mixed Precision using ProgressiveNN
This Section introduces AP and MP, the usecase-level algorithms using ProgressiveNN that Pianissimo employs to enhance the efficiency. Through these strategies, Pianissimo dynamically adjusts the bitwidth associated with weights according to the various computational requirements.
AP is the strategy for continuous time series data, where the current classification result determines the bitwidth for processing the next data FIGURE 13 (a). The bitwidth switch is based on the classification confidence level, which indicates how dominant the probability of the inferred class is compared to the probabilities of other classes [32]. The confidence level is defined as the entropy, which is the amount of information that an external processor should calculate. When the confidence exceeds a threshold, a strategy is taken to either narrow the bitwidth or maintain the current bitwidth, depending on the specific requirements of the task. The confidence can depend on both the weight precision and the image-specific features.
Efficiency improvement with AP/MP of ProgressiveNN. (a) AP adjusts the bitwidth based on the previous classification confidence (entropy). (b) MP decides bitwidth pairs at the inference time.
MP is another strategy that employs different bitwidths for different layers, based on the fact that different layers require different levels of computational accuracy FIGURE 13 (b). It is worth noting that the MP implementation of ProgressiveNN differs from traditional methods in that it uses only a single set of weights. This means that no additional memory cost is required to realize multiple MP weight sets. As a result, Pianissimo can seamlessly continue inference with only a single set of weights even if the bitwidth set changes during the inference process. Thus, Pianissimo’s basic strategy is to use AP and MP together in order to maximize efficiency.
Measurement Results
This Section reports the Pianissimo accelerator’s competitive performance within the ultra-low power domain. Furthermore, Pianissimo demonstrates its versatility by facilitating flexible inference across a wide range of NNs [9], [10], [11], [33]. Crucially, for all our evaluations, after the input data and the network model are loaded into the UMEM, the inference is carried out seamlessly until the end, eliminating the need for data transfers to external memory during inference. Also, we employed an 8-bit quantization for activations throughout the evaluations, and a multiply-accumulate operation is counted as two operations.
Section-V-A introduces the microphotograph and specifications of the fabricated chip. Section-V-B provides a power consumption analysis and highlights that Pianissimo operates in the sub-mW range. Section-IV-E observes the tradeoffs when applying AP/MP, exploring the potential for adaptive inference at the edge. Section-V-D examines the performance impact of using BS and demonstrates the potential for substantial performance gains. Finally, with comprehensive evaluations, Section-V-E compares Pianissimo with recent ultra-low power ML accelerators.
A. Chip Implementation and Evaluation Environment
This Section reports the measured results of the Pianissimo fabricated on a 9 mm2 die using TSMC 40 nm CMOS (ULP) technology. FIGURE 14 includes a chip microphotograph and a specification table. The core logic area occupies 4.92 mm2, with memory components occupying 92 % of this space. UMEMs dedicated to AMEM or WMEM are strategically located close to their respective ACM or WCM, respectively, and switchable UMEMs are placed near both caches. The core logic is placed between ACM and WCM to minimize routing delays. Twenty chips were produced, with slight variation. The results of one of them are reported in this article.
We verified Pianissimo behavior using Verilog HDL and ModelSim simulator of version 2019.4. Pianissimo is implemented in 68,283 lines of Verilog HDL codes, partly expanded with the Ruby language.
FIGURE 15 shows evaluation environment for Pianissimo. A Pianissimo chip mounted on a field programmable gate array (FPGA) mezzanine card (FMC) connects to a ZC702 FPGA board that handles the input/output data. The PC controls the power supply unit and the FMC’s clock generator via LAN to sweep the voltage and operation frequency. For evaluation, the PC transfers the test data to Pianissimo through FPGA, and the resulting outputs are transferred back and verified to the expected values. The recorded power measurement is related solely to the core and does not account for external memory accesses.
B. Power Consumption Analysis
FIGURE 16 depicts the power consumption trends across operational voltages ranging from 0.7 V to 0.9 V, taking into account varying clock frequencies. The observed power consumption oscillates between
Power consumption vs. frequency with the operational voltages from 0.7 V to 0.9 V. The gray hatch indicates the sub-mW region, and the measured model is 4-bit MobileNetV
FIGURE 17 depicts the power in each layer at 5 MHz and 20 MHz at 0.7 V in FIGURE 16. The dotted lines represent the average power of
Power consumption in each layer of MobileNetV1 (MLPerf Tiny). The doted lines indicate the average power (see Fig. 16).
For further power usage analysis in Pianissimo, FIGURE 18 presents a detailed power consumption breakdown when operating with a 4-bit PW layer at 0.9 V and 70 MHz. This breakdown organizes power usage among five primary components: PEA, PPU, UMEM, CTRL, and UMEM CONF. Here, UMEM CONF plays a key role in managing the UMEM configuration, thereby facilitating configurable memory space. Interestingly, PEAs emerge as the most power-intensive, accounting for 79.4 % of total consumption. In contrast, memory’s share was only 4.2 % despite typically being a significant power consumption. This efficiency is largely attributed to the data management within the three levels of the memory hierarchy, despite when working with models that inherently offer limited data reuse. Furthermore, the CTRL module, containing the RISC processor and the HW counter complex, consumes less than 3.4 % of the power, ensuring flexible control with minimal overhead.
Power breakdown using 4-bit PW layer. The total power consumption is 5.734 mW at 0.9 V and 70 MHz.
C. Power and Accuracy of AP/MP
We analyzed power consumption in light of dynamic bitwidth variations using AP with MobileNetV
AP/MP evaluation using MobileNetV1 (MLPerf Tiny): (a) Power vs. bitwidth. (b) AP/MP accuracy vs. bitwidth (left), and accuracy vs. TOPS/W (right). Three orange stars is the AP accuracy combined with MP.
For a comprehensive analysis, we further investigated the relationship among bitwidth, accuracy, and energy efficiency using the same MobileNetV1 model. The left segment of FIGURE 19(b) indicates that the tradeoff between accuracy and bitwidth is most prominent between 1-bit and 3-bit, nearly saturating beyond 4-bit. The accuracy of AP falls into 72 % at 1-bit. Nevertheless, our findings confirm that the combination of AP and MP considerably improves this tradeoff. As represented by the three distinct orange stars, the combination of MP and AP delivers accuracy levels comparable to the 4-bit to 8-bit model on an average of 2-bit. The practical results are also obtained on ImageNet dataset, as shown in FIGURE 20. It should be noted that these bitwidth allocations for layers were determined empirically, suggesting potential for further efficiency improvements by using neural architecture search algorithms like those proposed in [10], [43], and [44].
AP accuracy using MobileNetV
The right Section of FIGURE 19(b) shows the relationship between AP/MP accuracy and TOPS/W, where the vertical axis is consistent with the left figure. The combination of MP and AP outperforms AP-only configurations in delivering higher accuracy at similar TOPS/W levels, indicating a more favorable tradeoff. The TOPS/W of an MP approximately follows the same trajectory as an AP with the same average bitwidth. A proportional relationship also exists between the average bitwidth and execution time.
It should be highlighted that both AP with MP and AP-only make use of the same weight sets. This implies that Pianissimo can handle MP and AP variations without requiring additional weight parameters. The only extra overhead comes from BN parameters to improve accuracy, but their memory footprint is significantly smaller compared to the NN weights. In Pianissimo, the DMEM storing the BN parameters has enough memory space to hold multiple sets of parameters. TABLE I shows DMEM requirements and utilization for two large evaluation models. From the table, we can confirm that four sets of BN parameters of 1–4 bit with accuracy-computation tradeoff can be held.
D. Impact of Block Skip
In this section, we report on the influence of BS on performance metrics. The performance derived from BS depends on the size of the input feature map and the model. Therefore, our evaluation targeted standard
FIGURE 21(a) illustrates the variation in energy efficiency and peak performance across different weight bitwidths without BS. The solid lines depict energy efficiency (TOPS/W) mapped on the left vertical axis, whereas the dotted lines indicate peak performance (GOPS) on the right vertical axis. Each color variation corresponds to a different bitwidth. Evaluation results show that energy efficiency spans 0.7 to 1.1 TOPS/W at 4-bit, extending from 1.8 to 3.0 TOPS/W at 1-bit. Maximum efficiencies were consistently achieved at 0.7 V across various bitwidths. Peak performance ranged from 4.6 to 1.2 GOPS for 4-bit and 18.1 to 4.5 GOPS for 1-bit. GOPS and TOPS/W are calculated as follows:
Efficiency improvement of BS in typical Conv2d layer at 20 MHz and 0.7 V: (a) without BS and (b)(c) with BS. Each color consistently represents the bitwidth variation.
FIGURE 21(b) and FIGURE 21(c) show the efficiency improvement achieved using BS, measured at 20 MHz and 0.7 V. The horizontal axis indicates the skip ratio; the higher percentage corresponds to the smaller ROI area within the image. In the scenario without BS, a 1-bit Conv2D registers a performance of only 3.0-1.8 TOPS/W at voltages between 0.7-0.9 V (see FIGURE 21 (a)). With the application of BS, this range is boosted into a span between 27.7-10.2 TOPS/W. Notably, at a skip ratio of 0.9, BS provides a significant efficiency improvement, resulting in an enhancement of roughly
E. Overall Evaluations and Comparison
TABLE 2 summarizes the overall results at 20 MHz and 0.7 V from five modern tiny network models: MobileNetV2 [9], MobileNetV1 [33], the Visual Wake Words (VWW) [47] challenge 2019 champion model [10], and two MicroNet variants [11]. Note that MicroNet was used for the 8-bit model in accordance with the original paper. In addition, we evaluated a classification task using edge images to investigate the further possibility of data available at the edge environment, considering a potential integration with event vision sensors such as one proposed in [48]. For this purpose, we created and evaluated the edge VWW dataset using the edge extraction technique described in [49].
These comprehensive results highlight the capability of Pianissimo to provide practical inference speeds (inference/sec) throughout all model variations, including the 1-bit to 8-bit models. Inference/sec is calculated as
TABLE 3 lists the measurement results for the two main evaluation models. As mentioned in Section III-A, the observation that power consumption decreases as bitwidth increases is also confirmed at the model-level analysis. When operating at 20 MHz and 0.7 V, the models work around 1 mW. When the conditions are adjusted to 80 MHz and 0.9 V, they work in the low-power range, staying below 10 mW. In addition, at these settings, peak performances of 5.148 GOPS and 7.961 GOPS were registered at 80 MHz. The table suggests that the models deliver competitive performance, even when accounting for their relatively modest levels of parallelism. In summary, Pianissimo ensures the practical inference capability in the wide range of NNs under the condition of ultra-low power.
TABLE 4 compares pianissimo with recent ultra-low-power inference accelerators [24], [25], [26], [27], [28]. Since the 4-bit weight model offers adequate precision (see 19(b)), we use this weight model as our evaluation standard. Pianissimo exclusively supports 8-bit activation to ensure sufficient accuracy. Typically, TOPS/W and GOPS have an inverse proportionality with bitwidth of both weight and activation.
While the accelerators proposed in [24] and [28], operate with ultra-low power consumption, they are confined to specific NN models, limiting their applicability across diverse edge environments. Similarly, CNN core in [25] can handle mixed 1-bit and 16-bit precision, but the minor impact on peak performance suggests its implementation lacks efficiency. [27] The speech enhancement accelerator presented in [27] supports DSCNNs and optimizes its computational complexity based on the frequency band. However, its range of supported and applicable applications is narrow. Unlike Pianissimo, it lacks a flexible control mechanism, such as RISC, to optimize power consumption.
TinyVers [26] stands out for its commendable efficiency across several models within the ultra-low power spectrum but presents certain limitations. Particularly, its adaptability leaves room for improvement. Notably, TinyVers does not accommodate the parameter-efficient DSCNNs. Also, a clear gap exists between its performance and theoretical efficiency when downscaling both weights and activations from 8-bit to 2-bit. Instead of achieving the ideal
Conclusion
This paper presents a sub-mW class inference accelerator called Pianissimo, supporting progressively adjustable bit-precision. Leveraging a progressive bit-by-bit datapath, Pianissimo achieves adaptive precision that ranges from 1-bit to 8-bit. Remarkably, scalable precision applications, AP and MP, are obtained using a single weight set without reducing PE utilization. Pianissimo also supports BS processing using sensor information and suppresses unnecessary computation of non-ROIs. SW-HW cooperative control enhances the system’s flexibility, enabling it to accommodate various adaptive inference approaches. Our results show that Pianissimo achieves 0.49–1.25 TOPS/W at 0.7 V on MobileNetV1. Additionally, Pianissimo demonstrates practical performance across various models while operating on sub-mW class power. Thus, Pianissimo introduces a new dimension of flexibility to ultra-low power applications and shows promise in broadening the scope of use cases that can be efficiently supported. Future work will focus on integrating Pianissimo with actual sensor systems and exploring further flexibility enhancements through sparsity utilization within the low-power paradigm.