Loading web-font TeX/Main/Regular
Pianissimo: A Sub-mW Class DNN Accelerator With Progressively Adjustable Bit-Precision | IEEE Journals & Magazine | IEEE Xplore

Pianissimo: A Sub-mW Class DNN Accelerator With Progressively Adjustable Bit-Precision


Pianissimo is designed for adaptive inference in extreme edge environments at sub-mW class power, adaptively adjusting inference accuracy and computational complexity. Pi...

Abstract:

With the widespread adoption of edge AI, the diversity of application requirements and fluctuating computational demands present significant challenges. Conventional acce...Show More

Abstract:

With the widespread adoption of edge AI, the diversity of application requirements and fluctuating computational demands present significant challenges. Conventional accelerators suffer from increased memory footprints due to the need for multiple models to adapt to these varied requirements over time. In such dynamic edge conditions, it is crucial to accommodate these changing computational needs within strict memory and power constraints while maintaining the flexibility to support a wide range of applications. In response to these challenges, this article proposes a sub-mW class inference accelerator called Pianissimo that achieves competitive power efficiency while flexibly adapting to changing edge environment conditions at the architecture level. The heart of the design concept is a novel datapath architecture with a progressive bit-by-bit datapath. This unique datapath is augmented by software-hardware (SW-HW) cooperative control with a reduced instruction set computer processor and HW counters. The integrated SW-HW control enables adaptive inference schemes of adaptive/mixed precision and Block Skip, optimizing the balance between computational efficiency and accuracy. The 40 nm chip, with 1104 KB memory, dissipates 793- 1032~\mu \text{W} at 0.7 V on MobileNetV1, achieving 0.49-1.25 TOPS/W at this ultra-low power range.
Pianissimo is designed for adaptive inference in extreme edge environments at sub-mW class power, adaptively adjusting inference accuracy and computational complexity. Pi...
Published in: IEEE Access ( Volume: 12)
Page(s): 2057 - 2073
Date of Publication: 26 December 2023
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

Edge AI offers attractive advantages, including low latency, power efficiency, reduced bandwidth, and enhanced data privacy [1]. However, the high computational complexity of deep neural networks (DNNs) has been a significant obstacle to their deployment at the edge devices. To address this issue, researchers have employed optimization techniques such as quantization [2], [3], [4], [5], pruning [6], [7] and highly efficient model design [8], [9], [10], [11]. These methods reduce computational load and memory footprint, enabling the deployment of advanced DNNs on resource-constraints edge platforms.

The widespread adoption of edge AI has led to diverse application requirements, such as low power consumption, minimal latency, high accuracy, and high efficiency. In response, a broad scope of accelerators have been proposed, ranging from highly flexible and efficient accelerators [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] to operating at ultra-low power consumption [23], [24], [25], [26], [27], [28]. Notably, bit-scalable accelerators accelerators [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] provide arithmetic designs that realize variable bit-precision at runtime, natively supporting mixed-precision operations. It is known that DNNs require different bit-precision levels at different layers [4], [5], such adaptability is crucial for enhancing efficiency.

However, endpoint edge devices present dynamic computational needs that vary both with environmental conditions and over time [29]. Such variability introduces three core challenges in edge environments. 1) Memory budget: Conventional DNN accelerators typically necessitate the use of multiple models to adapt to fluctuating computational demands over time. For instance, surveillance cameras may require different levels of computational resources based on variables like time of day and human activity. The need for such multiple models increases the memory footprint, a critical issue for edge AI systems operating under strict memory limitations. 2) Power constraints: Devices at the extreme edge, including surveillance cameras, have not only highly variable computational needs but also face stringent power restrictions. They have requirements for ultra-low power operation near sub-mW range [30]. 3) Operational flexibility: There is a need for the ability to efficiently process a wide range of neural network models, given the diversity of applications in versatile edge environments. In summary, edge AI systems require adaptive solutions that respond to changing resource demands while maintaining efficiency and minimizing memory and power consumption.

To address these issues, we propose Pianissimo [31], a sub-mW class inference accelerator with progressively adjustable bit-precision. Pianissimo is based on the concept of adjusting the computational complexity according to the inference difficulty: using more computation for complex tasks and less computation for easy tasks. Pianissimo mainly supports the following two features: 1) adaptive model switching to extract models of various bitwidth versions from a single model and avoid unnecessary computations in simple tasks; 2) dynamic processing control to process only the regions of interest (ROI) specified by the image sensor. The adaptive model switching is based on the authors’ proposed ProgressiveNN [32], which extracts high bitwidth representations from a single model, reducing the computational complexity for simple tasks. Dynamic processing control reduces computational complexity by only processing the ROIs from the image sensor.

At the heart of our design is a novel datapath architecture with progressive bit-serial datapaths. Our proposed accelerator is distinct from previous flexible bit-precision accelerators [13], [14], [15], [16], [17], [18], [19] in terms of allowing low to high bitwidth representations with a single weight. By adopting a bitwise quantization representation and bit-serial accumulation scheme from the most significant bit (MSB) to the least significant bit (LSB), our design ensures high flexibility despite the simple circuit design. Additionally, our bit-serial datapath can be straightforwardly but efficiently extended to mixed precision. This accumulation scheme allows maximum utilization of the processing element (PE) while ensuring without reducing its functionality.

To further enhance the model-level efficiency, we support well-designed depthwise separable convolutional neural network (DSCNN)–based models [9], [10], [11], [33] with two different datapaths: depthwise (DW) and pointwise (PW) layers. DW and PW layers, introduced in MobileNet [8], have brought innovative lightweight to computational efficiency. By dividing the convolution layer into DW for spatial extraction and PW for channel extraction, MobileNet achieved significant improvements in both computational and parameter efficiency without compromising inference accuracy. Therefore, the lightweight advantage offered by these is crucial for edge AI with limited computation and memory resources. To efficiently process PW and DW layers, we employ two datapath designs to handle distinct DW and PW feature extraction dimensions [34]. Moreover, Pianissimo seamlessly handles the transposition of output feature maps to accommodate these varied processing orders. This transposition allows for layer-type-specific input data supply.

Progressive bit-serial DW/PW and block skipping processing are overseen by software-hardware (SW-HW) cooperative control using a reduced instruction set computer (RISC) processor and HW counter complex; the RISC also contributes to bit-serial PW/DW as well as dynamic processing skipping control using sensor information. Our work shows that the control scheme integrated with the RISC and the HW counters provides a solution that significantly increases the flexibility of the edge AI inference with a bit of power overhead under ultra-low power.

The remainder of this article is structured as follows. SectionII introduces the recent bit-level flexible accelerators and ultra-low-power accelerators. SectionIII presents two core algorithms of the proposed accelerator, facilitating adaptive inference at the edge. SectionIV introduces a sub-mW class inference accelerator called Pianissimo that features a progressive bit-by-bit datapath. SectionV shows evaluation results using advanced small NNs. Finally, SectionVI concludes this article.

SECTION II.

Related Works

A. Bit-serial/decomposed accelerators

Bit-serial computation straightforwardly provides flexibility for neural networks by its fineness. Stripes [13] provided accuracy and performance flexibility by computing a single input operand in a bit-serial manner. UNPU [16] offers a similar trajectory and improves area efficiency by incorporating lookup table-based PEs.

Differing from the fully bit-serial computation approach adopted by Stripes and UNPU, Bit Fusion [15] unveiled a fused-PE. This innovative design dynamically configures itself based on the bit-precision of the input operands. It achieves the mixed precision of power of two by spatially distributing bits as 2-bit bricks and then merging them appropriately. The bit-serial scheme of Pianissimo stands out by storing all lower bitwidth values in a single weight value, expanding these values over time.

There is a growing interest in the utilization of bit-level sparsity for further optimization [14], [17], [18], [19], [20]; leveraging sparsity is highly efficient as it eliminates the need for superfluous zero operations in the datapath [35], [36], [37]. Bit-Pragmatic [14] introduced bit-level sparsity compression to the activation using bit-serial computation, significantly improving the computational efficiency of NNs. Bit-Tactical [17] presented value-level weight skipping in addition to using bit-level activation sparsity. Bit-Tactical also reduced job latency by allowing data movement to neighbor lanes to address the issue of the load imbalance caused by sensitive skipping processing. Laconic [18] introduced bit-level sparsity for both activation and weights. Bitlet [19] reduced the load imbalance problem by reducing the need for synchronization using bit-interleaved PE that treats skipping nonzero values irregularly aligned. Ristretto [20] enabled value and bit-level skipping of both weight and activations by streaming flattened bit-brick sequences with compression of nonzero bricks.

The sparsity utilization of bit-level and value-level suggests further efficiency gains in Pianissimo. However, it is also crucial to consider potential challenges. Introducing such fine-grained speedup might result in critical overhead to circuit area and power consumption matters, especially when operating within ultra-low-power conditions.

B. Ultra-low power ML accelerator

The domain of ultra-low power machine learning (ML) accelerators presents a complex challenge: executing the desired NN process with minimal energy. Some accelerators have navigated this challenge in sub-mW/nW, specializing in tiny models for specific applications [23], [24], [28].

Shan et al. [23]has created a keyword spotting, including mel frequency cepstrum coefficients, chip fabricated with a 28 nm process, operating at a mere 510 nW. The NN core uses a binary DSCNN and a small register file-based memory block to achieve 94.6 % on a two-word classification task while minimizing data movement and computational cost. Lu et al. [24] proposed a 65 nm chip running at 184~\mu \text{W} with a slight two-layer edge CNN for hand motion detection and feature extraction. Kosuge et al. [28] achieves speech recognition of 35 keywords at 153~\mu \text{W} in a 40 nm process by fully unrolling the NNs in the circuit and reducing data movement to the limit. In addition, Kosuge et al. [28] employs the model pruned by more than 95 % to significantly reduce the circuit area while improving accuracy by training the activation functions.

In recent years, NN accelerators have been proposed that combine ultra-low power consumption and flexibility [25], [26], [27]; Pianissimo is one of these. Jokic et al. [25] proposes a face recognition system that combines CNN and the binary decision tree and reduces power consumption by running the power-hungry CNN core only when necessary. The CNN core supports 1-bit and 16-bit weights. Park et al. [27] achieve high-quality speech enhancement at 740 uW with band optimization that dynamically adjusts computational complexity based on the frequency band. Furthermore, the use of 4-bit logarithmic quantization allows for a PE that operates without the need for multipliers, and its PE supports DW and PW layers. TinyVers [26] is a highly flexible accelerator despite its ultra-low power. TinyVers’ PE array (PEA) supports two datapaths, broadcast, and multicast of weights, with the RISC-V processor overseeing the entire process. The chip fabricated with 22 nm FDX with embedded magnetoresistive random access memory (eMRAM) runs on several types of NNs, including ResNet-8 of MLPerf Tiny [33], and achieves high tera operations per second per watt (TOPS/W) with ultra-low power consumption. Ont achieved using BS, measu supports DW layer dataflow, which is critical for edge AI, and provides adaptive bit-precision with a single weight. A detailed comparison between Pianissimo and these ultra-low power accelerators is presented in Section-V-E.

SECTION III.

Adaptive Algorithms Behind Pianissimo

Pianissimo was designed for adaptive inference at extreme edge environments. To accomplish this, Pianissimo employs an adaptive adjustment between inference accuracy and computational complexity, as illustrated in FIGURE 1. This adjustment strategy is achieved through two essential model-level algorithms: ProgressiveNN and Block Skip (BS). Both algorithms excel in dealing with information available at edge environments, each offering a unique approach to adaptive computational complexity management. ProgressiveNN allows for model switching based on the difficulty of the input task as shown in FIGURE 1 bottom. BS focuses on trimming unnecessary computations outside the ROI, as shown in the top of FIGURE 1. Pianissimo leverages these adaptive processing algorithms to enable adaptive inference and improves efficiency for edge AI applications. This Section details these two algorithms, clarifying the core design concept behind Pianissimo.

FIGURE 1. - Adaptive adjustment of the accuracy-computation tradeoff based on ProgressiveNN and block skip.
FIGURE 1.

Adaptive adjustment of the accuracy-computation tradeoff based on ProgressiveNN and block skip.

A. ProgressiveNN

ProgressiveNN is a flexible bit-precision network that dynamically switches the bitwidth of NN weights according to the task difficulty. ProgressiveNN, proposed by the authors, features 1) bitwise numeric representation that allows MSB to LSB computation and 2) batch normalization (BN) retraining to improve accuracies of low-bitwdth models.

Whereas general ML accelerators employ fixed-point representation as their main computation scheme, Pianissimo adopts ProgressiveNN’s bitwise binary quantization scheme. Each value is quantized bitwise, with each binary digit representing either +1 or −1. The value can therefore be seen as a set of decomposed binary values.

ProgressiveNN has a nested structure where the MSB is the outermost in the bitwise decomposed values, as shown in FIGURE 2(a). The main difference with other numerical representations is that this nested representation is processed in order from the outermost MSBs. In more detail, ProgressiveNN obtains target high bitwidth values by accumulating from MSB to LSB computation while considering its digit’s place value. We describe this accumulation scheme using the fully-connected layer.

FIGURE 2. - ProgressiveNN algorithm. (a) In ProgressiveNN, each weight digit represents either +1 or −1, and the accumulation from MSB to LSB yields a high bitwidth network. (b) As the bitwidth expands, quantized values asymptotically approach the intended target values.
FIGURE 2.

ProgressiveNN algorithm. (a) In ProgressiveNN, each weight digit represents either +1 or −1, and the accumulation from MSB to LSB yields a high bitwidth network. (b) As the bitwidth expands, quantized values asymptotically approach the intended target values.

For the sake of simplicity, we explain this accumulation scheme using the fully-connected layer with C input channels. The j -th output z_{j} is described as follows \begin{align*} z_{j} &= \sum _{i=1}^{C} w_{i,j} x_{i} + b_{j} \\ &= \sum _{i=1}^{C}\sum _{n=1}^{N} w_{i,j}[n] x_{i} \cdot 2^{n-1} + b_{j} \tag{1}\end{align*}

View SourceRight-click on figure for MathML and additional features. where w_{i,j}[n] \in \{+1, -1\} is the n -th digit of N -bit weights from the i -th input neuron to the j -th output neuron, x_{i} is the i -th input activation, and b_{j} is the j -th bias. Recalling that the weights are computed from the upper bits, we notice that stopping the computation in the middle of a calculation can yield a lower bit value. If only the upper M -bits of the N -bit weights are used, (1) is described as follows \begin{equation*} z_{j} = \sum _{i=1}^{C}\sum _{n=N-M+1}^{N} w_{i,j}[n] x_{i}\cdot 2^{n-1} + b_{j}. \tag{2}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

Thus, only one set of N -bit weights is needed to achieve the desired smaller bitwidth weight. As shown in FIGURE 2(b) of the example using 4-bit values, increasing the bitwidth used in ProgressiveNN asymptotically close to the target values. However, when applied this bitwise representation directly to NNs, the accuracy is significantly degraded at low bitwidths because of changing distribution [38], [39]. To address this problem, ProgressiveNN restores accuracy by retraining the BN for each weight bitwidth sets while freezing the weight parameters, as shown in FIGURE 2(c).

B. Block Skip

BS is an integrated approach crafted to amplify computational efficiency by suppressing non-essential processing outside regions of interest (ROIs) designated from low-power event-driven sensors like one proposed in [40]. Its idea is based on the concept of elevating the efficiency of applications that seamlessly merge ultra-low-power edge AI with sensor technology. This integration aims to leverage the information harvested directly from endpoint sensors to optimize computational demands.

When the sensor detects motion and identifies the region of interest, a binary mask is supplied to Pianissimo accelerator with the ROI position as one and the rest as zero. As illustrated in FIGURE 3, an inference procedure is then initiated using only this masked data, thereby excluding irrelevant areas and increasing efficiency. The figure shows that the grey block of the intermediate feature map indicates that specific operations are unnecessary and save power consumption by skipping computations. In processes that require a reduction of the intermediate feature map, such as downsampling processes, the ROI mask is consistently reduced and maintains its size with the changed data.

FIGURE 3. - Block skip algorithm using ROI masks obtained from the event-driven sensors.
FIGURE 3.

Block skip algorithm using ROI masks obtained from the event-driven sensors.

BS significantly advances in vision tasks with limited movement or narrow scope. One example is in the area of security cameras. These devices are often installed in zones where sporadic or slight motion is detected, and in such scenarios, these cameras often capture a little or small action from a wide angle of view. By employing BS in such situations, computational loads can be drastically reduced, realizing significant efficiency improvement.

This dynamic process control is conducted by a RISC processor, which oversees the processing of input segments and instructs where to use BS processin; dynamically adjusts processing based on ROI mask data to ensure that only essential data is processed; and ensures that the RISC processor is able to process the input segments in the same way as the RISC processor. Details of the RISC control are described in Section IV-C. RISC also has dedicated instructions for reducing the BS mask in downsampling.

SECTION IV.

Pianissimo

We designed the inference accelerator called Pianissimo to realize ultra-low-power yet flexible inference at extreme edge circumstances. The key feature is the progressive bit-by-bit datapath, enabling ProgressiveNN inference and ensuring it meets various requirements in edge environments. Pianissimo incorporates dynamic model switching and dynamic BS processing via an integrated SW-HW control approach. In addition, SectionIV-E describes two usecase-level algorithms using ProgressiveNN: adaptive precision (AP) and mixed precision (MP).

A. Pianissimo Overall Architecture

FIGURE 4 displays the overall Pianissimo architecture that realizes the adaptive inference shown in FIGURE 1. The architecture mainly consists of five parts: unified memory (UMEM), PEA, post-processing unit (PPU), controller (CTRL), and clock gating unit (CGU).

FIGURE 4. - Pianissimo architecture overview and the UMEM configuration.
FIGURE 4.

Pianissimo architecture overview and the UMEM configuration.

For the datapath, a three-layer memory hierarchy–buffer (ABUF/WBUF), cache (ACM/WCM), and unified memory (UMEM)–is used to suppress the power consumption by maximizing data reuse and minimizing data movement. As shown in the bottom of FIGURE 4, UMEM is a configurable 7.5 Mb memory space for both weight and activation in which the sizes of weight memory (WMEM) and activation memory (AMEM) are 1-4 Mb, and 3.5- 6.5 Mb, respectively. Such adaptability is crucial to accommodate this study’s diverse neural network (NN) demands on-chip. For instance, when dealing with significant weight parameters or extensive bitwidth, the approach is to increase the proportion of WMEM. Contrarily, when the activation is prominent, the balance of AMEM is increased to meet the various requirements.

The ACM and WCM adopt direct-mapped caches focusing on the spatial and temporal locality in NN inference. The 64 Kb ACM/WCM is more energy-efficient in terms of reading and writing operations compared to the more expansive 512 Kb UMEM (from memory specifications). Therefore, utilizing the cache becomes particularly beneficial when data is reused more than thrice. However, considering the power consumption of transferring data from the UMEM to the ACM/WCM, bypassing and deactivating the ACM/WCM is more power-saving if reuse falls below this threshold. The 64 Kb memory used as a cache consumes 38% less power when reading data than the 512 Kb memory used in UMEM for memory used in our design. When data is accessed three times, it requires three cache reads and one UMEM read. This power consumption is lower than reading from the UMEM three times.

The controller consists of a customized RISC processor with dedicated instructions for BS and a counter complex for bit-serial multiply-accumulate (MAC) operations. The controller architecture includes both an instruction memory (IMEM) and a data memory (DMEM), each holding a memory capacity of 512 Kb. The IMEM is spacious enough to store the instructions required to execute NN models. Concurrently, the DMEM can keep all the BN parameters for high bitwidth sets on-chip. The cooperative control of the on-chip RISC processor and HW counters allows for flexibility and speed of model switching and ROI processing. The CGU governs the entire core to improve power efficiency further, reducing redundant power consumption.

Two 8\times 8 processing element arrays (PEA0 and PEA1) work jointly to fill the datapath pipeline according to the convolution mode and perform bit-serial MAC operations. The output buffer (OBUF) accumulates the output partial sums, then sends the accumulating results into the PPU after adequately transposing the output direction. OBUFs are double-buffered for efficient processing, ensuring a seamless PEA and PPU pipeline process.

The PPU processes three tasks: the BN process, the clipped rectified linear unit function, and conversions of the quantization format. The BN operation is executed in a 16-bit floating point (FP16) format to enhance inference accuracy, as mentioned in [41]. Therefore, the accumulation results in a 22-bit integer format are converted to FP16 and then converted to an 8-bit block floating point after the affine process, where a common 5-bit exponent is directed from the RISC. The 16 post-processed results are then packed and written to AMEM.

B. Bit-serial PE and PW/DW Dataflows

As illustrated in FIGURE 5(a), ProgressiveNN is realized with a straightforward bit-serial PE. This PE is primarily composed of a sign-inverter and a shifter.

FIGURE 5. - The core arithmetic unit. (a) ProgressiveNN is realizing with sign inversion and shift operations. (b) The register column corresponding to each PEC accumulates the partial sums.
FIGURE 5.

The core arithmetic unit. (a) ProgressiveNN is realizing with sign inversion and shift operations. (b) The register column corresponding to each PEC accumulates the partial sums.

In Pianissimo, the weights, where each bit digit representing the binary value {−1, +1}, are fed in a bit-serial manner, transitioning from the MSB to the LSB. The sign inverter inverts the sign of the input activations according to the corresponding binary weights. Subsequently, the PE column (PEC), with 8 PEs, aggregates the partial sums of the PEs, and the shifter multiplies the place value given by its weight digit. In the case of N -bit weights, the amount of shifting is N -1 bits when processing MSB and 0 for LSB.

To put it simply, the ProgressiveNN PE is essentially the specialized PE for binary NNs [42], augmented with a shifter. Compared to a fixed-point MAC PE with an 8-bit multiplication and a 22-bit accumulation, a bit-serial MAC PE shows a circuit overhead of around 23% in simulation using Synopsys Design Compiler. For a fair comparison, the bit-serial PE includes eight 1-bit operations on 8-bit values and their addition instead of 8-bit multiplication. However, this approach offers the gain of progressive progressively adjustable bit-precision.

A noticeable distinction lies in the numerical representation. In general numerical expressions, such as fixed-point numerical expressions, the value is calculated from the LSB, which causes carry and an increase in the number of digits. However, the calculation can be interrupted by processing the weights bit-serially from the MSB. This unique approach facilitates the implementation of high bitwidth weights without compromising the efficiency and utilization of the PE.

A PEA consists of 8 PECs (8\times 8 PEs), as shown in FIGURE 5(b), and two PEAs operate in parallel. The OBUF also has eight register columns; one column consists of 8 registers, and each register column is tasked with accumulating the output of the corresponding PEC. The separate registers in the column are for accumulating the partial sums with the different output channels. The specific register to be utilized is determined based on the scheduled loop of the output channel. The details are described in SectionIV-C. On the other hand, each register column is responsible for the accumulation of different output activation widths. These register columns are concatenated for the purpose of transposition of the output direction.

Pianissimo supports both PW and DW layers, which are essential for inference at the edge. Each of these layers is executed via different datapaths to utilize the PEA computation resources fully. 2D convolution operations (Conv2D) can be sequentially processed as multiple PW operations. FIGURE 6 shows the bit-serial PW and DW dataflows. This figure also highlights a three-level memory hierarchy that supplies weights and activations (center of FIGURE 6).

FIGURE 6. - The bit-serial PW/DW dataflows. Weights are supplied bit-by-bit.
FIGURE 6.

The bit-serial PW/DW dataflows. Weights are supplied bit-by-bit.

In the PW mode, each PEA takes as input an 8-bit activation for each of the eight rows and a 1-bit weight for each of the 8\times 8 PEs. To provide input data to both PEAs without delay, the WMEM packs 128 sets of 1-bit weights from 16 channels in each of the eight kernels into a single word, and the AMEM packs 16 sets of 8-bit activations into a single word in channel-major order illustrated in FIGURE 7(a) left. To avoid unnecessary weight read at low bitwidth inference, each memory word stores only the weights of a particular digit FIGURE 7 (c). Therefore, high bitwidth weights are read over time with a constant address stride according to the loop framework shown in SectionIV-C and the number of memory reads is proportional to the required bitwidth.

FIGURE 7. - Memory mapping for bit-serial PW and DW weights to maximize data reuse.
FIGURE 7.

Memory mapping for bit-serial PW and DW weights to maximize data reuse.

In DW mode, the PEA handles 8-bit activation for every diagonal and 1-bit weights for each row. The requisite number of activations and weights fluctuates based on the kernel size and stride (FIGURE 6 right). The ABUF and WBUF adjust the input accordingly depending on the size, and the CGU deactivated any superfluous PEs to tackle this. Unlike in PW mode, the AMEM groups 16 sets of 8-bit activations into a single word, organized in a row-major order. As for weights, they are stored with a specific memory mapping, as illustrated in FIGURE 7(b) right, to suit the varying DW kernel sizes of 3, 5, and 7. This scheme sequences weights for the eight grouped kernels in the order of priority: row, channel, bit, and column. This structure is optimized to handle multiple kernel sizes and to ensure the better allocation of bit-serial weights to these sizes. The grey hatches indicate zero padding. In DW mode, eight kernel rows are utilized simultaneously, making the memory readout design emphasize row readout efficiency.

Finally, OBUF accumulates the resulting partial sums from both modes, transposes the output direction on demand, and passes them to PPU. The details of the OBUF transpose for a single PEA are illustrated in Fig. 8. In scenarios without the transpose operation, the accumulation results are read vertically from the register column during each cycle, from time t0 to t7, with the specific output determined by a multiplexer. In contrast, when the transpose function is activated, the accumulated result is solely read from register column 0 throughout the output phase. During each cycle in this mode, data is read horizontally from each register column. Each register column then transfers its own data to the adjacent column on the left. The output direction is selected by the transpose flag managed by the RISC processor, resulting in 8 accumulative results from one PEA being sent to the PPU. Therefore, the PPU handles 16 outputs from 2 PEAs in parallel.

FIGURE 8. - Transposition of the output direction in OBUF for a single PEA. 16 resulting outputs from two PEA are sent to the PPU in parallel.
FIGURE 8.

Transposition of the output direction in OBUF for a single PEA. 16 resulting outputs from two PEA are sent to the PPU in parallel.

C. SW-HW Cooperative Control

FIGURE 9 shows the SW-HW cooperative control by the customized RISC processor and the HW counter complex. The RISC is equipped with 32 registers, each 32-bit. Register R00 is the zero constant register, and R01 is the special purpose register (SPR) for the core kick and RISC sleep flag. Registers R02 to R07 are general-purpose registers. The remaining registers, R08 to R10, R11 to R15, and R16-R31, are specialized for sublayer control, layer control, and batch normalization parameters, respectively.

FIGURE 9. - The RISC processor and HW counter complex.
FIGURE 9.

The RISC processor and HW counter complex.

The HW counter implements a quadruple nested loop that the RISC processor orchestrates. This loop configuration allows pianissimo the flexibility to operate in different modes, enabling it to switch between PW and DW modes and adaptively select the bitwidth for bit-serial weights.

Fig. 10 shows a timing chart of the RISC processor and the HW counter complex. Initially, the RISC sets up the control information and parameters necessary for subsequent core processing in the background of the core processing. It triggers the beginning of the RISC operations and subsequently goes into a deactivated state managed by the CGU. Once kicked off, the nested loops of the HW counter start to move, and the PEA follows the loops and operates the processing instructed by the RISC. After all loop processing is completed, the red chunk in Fig. 10 is obtained, and control is returned to the RISC. Once the output chunk is obtained, the RISC repeats the process until the end of the layer. The RISC performs BS by checking the ROI mask and skipping this chunk generation step.

FIGURE 10. - Timing chart for the control using the RISC and HW counters. THe RISC is gated after the parameter set and become active after HW loop is completed.
FIGURE 10.

Timing chart for the control using the RISC and HW counters. THe RISC is gated after the parameter set and become active after HW loop is completed.

FIGURE 11 shows a bit-serial PW/DW processing flow with 4-bit weights. The loops with grey hatches are processed in parallel in the PEAs. In PW mode, as shown in FIGURE 11, both the input data and weights are supplied in channel-major order, with weights being supplied bit-by-bit. The innermost loop of PEA is the output channel in consecutive PW layers and row direction in the PW-DW layer. For example, in a consecutive PW, PEA0 treats the 0–7 output channels, and PEA1 is 8–15 output channels. Outside the weight loop is an input row loop of up to 8, where the weights are reused in the time direction using WBUF. The outermost processes the input sub-channel loop, completing the inner product operation.

FIGURE 11. - DW/PW Control flows using 4-bit weights. The gray hatches in loop frameworks indicate parallel processing in PEA.
FIGURE 11.

DW/PW Control flows using 4-bit weights. The gray hatches in loop frameworks indicate parallel processing in PEA.

DW control flow is shown in FIGURE 11. The difference between PW and DW in terms of kernel geometry occurs in the first and third loops from the outside. In the case of PW, it is the kernel’s horizontal and vertical loops, whereas in DW, it is the horizontal and vertical loops of the kernel. The difference is due to the difference in kernel shape between PW and DW. In DW mode, the innermost loop of PEA is the output channel loop. Additionally, ABUF autonomously manages padding in the width direction, while padding in the height direction is handled through RISC-controlled processing skipping.

D. Power Management

Minimizing power usage is crucial for achieving ultra-low power operation. Pianissimo adopts a fine-grained gating strategy using the CGU, which turns off the clock input to idle registers on pipeline. The gating is applied to various components, such as UMEM I/O, WBUF/ABUF, PEA/OBUF, PPU, and the RISC processor, but DMA peripherals and CTRL units except for the RISC processor. As partially depicted in the figure (see FIGURE 10), both the RISC processor and the PPU have shorter execution times compared to the PEA, making them ideal candidates for significant power savings through CGU. Moreover, in the PEA, idle PEs are actively gated, particularly in cases involving DW layers with small kernel sizes, layers with small input feature maps, or layers not extra-allocated to a PEA’s size.

UMEM and WCM/ACM incorporate three distinct power management strategies: light sleep (LS), deep sleep (DS), and shutdown (SD). As shown in FIGURE 12, these power-saving modes significantly reduce power consumption compared to the normal operating mode. Specifically, LS, DS, and SD cut leakage power by 31 %, 58 %, and 82 %, respectively. Note that the power consumption is normalized in the figure. These modes are offered as a function of memory.

FIGURE 12. - Leakage power consumption with 512 Kb UMEM’s power management modes. The vertical axis is normalized with reference to the normal mode.
FIGURE 12.

Leakage power consumption with 512 Kb UMEM’s power management modes. The vertical axis is normalized with reference to the normal mode.

LS is designed for modest but immediate power savings and is dynamically applied throughout the inference process. DS is employed for memory components that are temporarily inactive, offering more substantial power reductions at the cost of longer resumption times. The SD mode is initiated for unused memory spaces during the inference process. While UMEM is designed with a comparatively larger memory space to accommodate a diverse range of models, its energy efficiency is optimized by using the SD mode adequately, thus keeping the power overhead to a minimum, even when executing smaller models with small memory requirements. In our design, since SD remains unchanged throughout the execution, it is directly managed externally through flags in the control register for DMA. On the other hand, DS and LS flags are overseen by the RISC to control flexibly.

E. Adaptive/Mixed Precision using ProgressiveNN

This Section introduces AP and MP, the usecase-level algorithms using ProgressiveNN that Pianissimo employs to enhance the efficiency. Through these strategies, Pianissimo dynamically adjusts the bitwidth associated with weights according to the various computational requirements.

AP is the strategy for continuous time series data, where the current classification result determines the bitwidth for processing the next data FIGURE 13 (a). The bitwidth switch is based on the classification confidence level, which indicates how dominant the probability of the inferred class is compared to the probabilities of other classes [32]. The confidence level is defined as the entropy, which is the amount of information that an external processor should calculate. When the confidence exceeds a threshold, a strategy is taken to either narrow the bitwidth or maintain the current bitwidth, depending on the specific requirements of the task. The confidence can depend on both the weight precision and the image-specific features.

FIGURE 13. - Efficiency improvement with AP/MP of ProgressiveNN. (a) AP adjusts the bitwidth based on the previous classification confidence (entropy). (b) MP decides bitwidth pairs at the inference time.
FIGURE 13.

Efficiency improvement with AP/MP of ProgressiveNN. (a) AP adjusts the bitwidth based on the previous classification confidence (entropy). (b) MP decides bitwidth pairs at the inference time.

MP is another strategy that employs different bitwidths for different layers, based on the fact that different layers require different levels of computational accuracy FIGURE 13 (b). It is worth noting that the MP implementation of ProgressiveNN differs from traditional methods in that it uses only a single set of weights. This means that no additional memory cost is required to realize multiple MP weight sets. As a result, Pianissimo can seamlessly continue inference with only a single set of weights even if the bitwidth set changes during the inference process. Thus, Pianissimo’s basic strategy is to use AP and MP together in order to maximize efficiency.

SECTION V.

Measurement Results

This Section reports the Pianissimo accelerator’s competitive performance within the ultra-low power domain. Furthermore, Pianissimo demonstrates its versatility by facilitating flexible inference across a wide range of NNs [9], [10], [11], [33]. Crucially, for all our evaluations, after the input data and the network model are loaded into the UMEM, the inference is carried out seamlessly until the end, eliminating the need for data transfers to external memory during inference. Also, we employed an 8-bit quantization for activations throughout the evaluations, and a multiply-accumulate operation is counted as two operations.

Section-V-A introduces the microphotograph and specifications of the fabricated chip. Section-V-B provides a power consumption analysis and highlights that Pianissimo operates in the sub-mW range. Section-IV-E observes the tradeoffs when applying AP/MP, exploring the potential for adaptive inference at the edge. Section-V-D examines the performance impact of using BS and demonstrates the potential for substantial performance gains. Finally, with comprehensive evaluations, Section-V-E compares Pianissimo with recent ultra-low power ML accelerators.

A. Chip Implementation and Evaluation Environment

This Section reports the measured results of the Pianissimo fabricated on a 9 mm2 die using TSMC 40 nm CMOS (ULP) technology. FIGURE 14 includes a chip microphotograph and a specification table. The core logic area occupies 4.92 mm2, with memory components occupying 92 % of this space. UMEMs dedicated to AMEM or WMEM are strategically located close to their respective ACM or WCM, respectively, and switchable UMEMs are placed near both caches. The core logic is placed between ACM and WCM to minimize routing delays. Twenty chips were produced, with slight variation. The results of one of them are reported in this article.

FIGURE 14. - Pianissimo chip microphotograph and its specification.
FIGURE 14.

Pianissimo chip microphotograph and its specification.

We verified Pianissimo behavior using Verilog HDL and ModelSim simulator of version 2019.4. Pianissimo is implemented in 68,283 lines of Verilog HDL codes, partly expanded with the Ruby language.

FIGURE 15 shows evaluation environment for Pianissimo. A Pianissimo chip mounted on a field programmable gate array (FPGA) mezzanine card (FMC) connects to a ZC702 FPGA board that handles the input/output data. The PC controls the power supply unit and the FMC’s clock generator via LAN to sweep the voltage and operation frequency. For evaluation, the PC transfers the test data to Pianissimo through FPGA, and the resulting outputs are transferred back and verified to the expected values. The recorded power measurement is related solely to the core and does not account for external memory accesses.

FIGURE 15. - Evaluation environment for Pianissimo.
FIGURE 15.

Evaluation environment for Pianissimo.

B. Power Consumption Analysis

FIGURE 16 depicts the power consumption trends across operational voltages ranging from 0.7 V to 0.9 V, taking into account varying clock frequencies. The observed power consumption oscillates between 275~\mu \text{W} and 5 mW, with the frequency spanning from 5 MHz to 80 MHz. Notably, for frequencies under 10 MHz, power usage remains below 1 mW across all voltage levels. Furthermore, this sub-mW consumption is also attainable at 20 MHz when operating at 0.7 V and 0.75 V. Pianissimo achieves ultra-low-power inference, registering power consumptions of 275~\mu \text{W} at 5 MHz and 793~\mu \text{W} at 20 MHz.

FIGURE 16. - Power consumption vs. frequency with the operational voltages from 0.7 V to 0.9 V. The gray hatch indicates the sub-mW region, and the measured model is 4-bit MobileNetV
$1\,\,0.25\times $
 (MLPerf Tiny).
FIGURE 16.

Power consumption vs. frequency with the operational voltages from 0.7 V to 0.9 V. The gray hatch indicates the sub-mW region, and the measured model is 4-bit MobileNetV1\,\,0.25\times (MLPerf Tiny).

FIGURE 17 depicts the power in each layer at 5 MHz and 20 MHz at 0.7 V in FIGURE 16. The dotted lines represent the average power of 275~\mu \text{W} and 793~\mu \text{W} , respectively. Since the odd-numbered layers are DW layers, they typically consume less power than the even layers. This power consumption reduction is attributed to the smaller DW kernel size in MobileNetV1, leading to some PEs being gated when operating in DW mode. The power is reduced in the latter half layers, mainly due to the smaller size of the input feature map, with some PEs being gated.

FIGURE 17. - Power consumption in each layer of MobileNetV1 (MLPerf Tiny). The doted lines indicate the average power (see Fig. 16).
FIGURE 17.

Power consumption in each layer of MobileNetV1 (MLPerf Tiny). The doted lines indicate the average power (see Fig. 16).

For further power usage analysis in Pianissimo, FIGURE 18 presents a detailed power consumption breakdown when operating with a 4-bit PW layer at 0.9 V and 70 MHz. This breakdown organizes power usage among five primary components: PEA, PPU, UMEM, CTRL, and UMEM CONF. Here, UMEM CONF plays a key role in managing the UMEM configuration, thereby facilitating configurable memory space. Interestingly, PEAs emerge as the most power-intensive, accounting for 79.4 % of total consumption. In contrast, memory’s share was only 4.2 % despite typically being a significant power consumption. This efficiency is largely attributed to the data management within the three levels of the memory hierarchy, despite when working with models that inherently offer limited data reuse. Furthermore, the CTRL module, containing the RISC processor and the HW counter complex, consumes less than 3.4 % of the power, ensuring flexible control with minimal overhead.

FIGURE 18. - Power breakdown using 4-bit PW layer. The total power consumption is 5.734 mW at 0.9 V and 70 MHz.
FIGURE 18.

Power breakdown using 4-bit PW layer. The total power consumption is 5.734 mW at 0.9 V and 70 MHz.

C. Power and Accuracy of AP/MP

We analyzed power consumption in light of dynamic bitwidth variations using AP with MobileNetV1\,\,0.25\times (MLPerf Tiny benchmark [11]). The inference was recorded at a fixed 12 frame per second (FPS) at 0.7 V, with the power consumption during idle states also taken into account. As illustrated in FIGURE 19(a), the lower bitwidths resulted in shorter inference execution times at fixed FPS, subsequently reducing the average power consumption. Importantly, across all bitwidths from 1-bit to 8-bit, Pianissimo consistently maintained sub-mW power levels during inferences. Peak power consumption escalated at narrower bitwidths, primarily because the constant power usage of the PPU became more dominant.

FIGURE 19. - AP/MP evaluation using MobileNetV1 (MLPerf Tiny): (a) Power vs. bitwidth. (b) AP/MP accuracy vs. bitwidth (left), and accuracy vs. TOPS/W (right). Three orange stars is the AP accuracy combined with MP.
FIGURE 19.

AP/MP evaluation using MobileNetV1 (MLPerf Tiny): (a) Power vs. bitwidth. (b) AP/MP accuracy vs. bitwidth (left), and accuracy vs. TOPS/W (right). Three orange stars is the AP accuracy combined with MP.

For a comprehensive analysis, we further investigated the relationship among bitwidth, accuracy, and energy efficiency using the same MobileNetV1 model. The left segment of FIGURE 19(b) indicates that the tradeoff between accuracy and bitwidth is most prominent between 1-bit and 3-bit, nearly saturating beyond 4-bit. The accuracy of AP falls into 72 % at 1-bit. Nevertheless, our findings confirm that the combination of AP and MP considerably improves this tradeoff. As represented by the three distinct orange stars, the combination of MP and AP delivers accuracy levels comparable to the 4-bit to 8-bit model on an average of 2-bit. The practical results are also obtained on ImageNet dataset, as shown in FIGURE 20. It should be noted that these bitwidth allocations for layers were determined empirically, suggesting potential for further efficiency improvements by using neural architecture search algorithms like those proposed in [10], [43], and [44].

FIGURE 20. - AP accuracy using MobileNetV
$2\,\,1.0\times $
 on ImageNet dataset. Note that this assessment solely focuses on SW evaluation.
FIGURE 20.

AP accuracy using MobileNetV2\,\,1.0\times on ImageNet dataset. Note that this assessment solely focuses on SW evaluation.

The right Section of FIGURE 19(b) shows the relationship between AP/MP accuracy and TOPS/W, where the vertical axis is consistent with the left figure. The combination of MP and AP outperforms AP-only configurations in delivering higher accuracy at similar TOPS/W levels, indicating a more favorable tradeoff. The TOPS/W of an MP approximately follows the same trajectory as an AP with the same average bitwidth. A proportional relationship also exists between the average bitwidth and execution time.

It should be highlighted that both AP with MP and AP-only make use of the same weight sets. This implies that Pianissimo can handle MP and AP variations without requiring additional weight parameters. The only extra overhead comes from BN parameters to improve accuracy, but their memory footprint is significantly smaller compared to the NN weights. In Pianissimo, the DMEM storing the BN parameters has enough memory space to hold multiple sets of parameters. TABLE I shows DMEM requirements and utilization for two large evaluation models. From the table, we can confirm that four sets of BN parameters of 1–4 bit with accuracy-computation tradeoff can be held.

TABLE 1 DMEM requirement for multiple sets of the BN parameters of the two large models in our evaluation
Table 1- 
DMEM requirement for multiple sets of the BN parameters of the two large models in our evaluation

D. Impact of Block Skip

In this section, we report on the influence of BS on performance metrics. The performance derived from BS depends on the size of the input feature map and the model. Therefore, our evaluation targeted standard 3\times 3.2\text{D} convolutional layers (Conv2D) equipped with 32 input and output channels. For the ROI image, we used bounding boxes with the motion labels, such as creatures and vehicles, from the Microsoft COCO dataset [46] to define the ROI region.

FIGURE 21(a) illustrates the variation in energy efficiency and peak performance across different weight bitwidths without BS. The solid lines depict energy efficiency (TOPS/W) mapped on the left vertical axis, whereas the dotted lines indicate peak performance (GOPS) on the right vertical axis. Each color variation corresponds to a different bitwidth. Evaluation results show that energy efficiency spans 0.7 to 1.1 TOPS/W at 4-bit, extending from 1.8 to 3.0 TOPS/W at 1-bit. Maximum efficiencies were consistently achieved at 0.7 V across various bitwidths. Peak performance ranged from 4.6 to 1.2 GOPS for 4-bit and 18.1 to 4.5 GOPS for 1-bit. GOPS and TOPS/W are calculated as follows: \mathrm {GOPS} = \frac {\mathrm {OPS} \times 2 \times \mathrm {frequency}}{\mathrm {cycles}} \times 10^{-9} , \mathrm {TOPS/W} = \frac {\mathrm {OPS} \times 2 \times \mathrm {frequency}}{\mathrm {cycles} \times \mathrm {power}} \times 10^{-12} , where cycles are the number of cycles per inference and are from ModelSim simulation. For instance, TOPS/W of 1-bit Conv2D layer at 20 MHz and 0.7 V is calculated as \frac {(44 \times 10^{6}) \times 2 \times (20 \times 10^{6})}{(391 \times 10^{3}) \times 0.0015} \times 10^{-12} \simeq 3.0 .

FIGURE 21. - Efficiency improvement of BS in typical Conv2d layer at 20 MHz and 0.7 V: (a) without BS and (b)(c) with BS. Each color consistently represents the bitwidth variation.
FIGURE 21.

Efficiency improvement of BS in typical Conv2d layer at 20 MHz and 0.7 V: (a) without BS and (b)(c) with BS. Each color consistently represents the bitwidth variation.

FIGURE 21(b) and FIGURE 21(c) show the efficiency improvement achieved using BS, measured at 20 MHz and 0.7 V. The horizontal axis indicates the skip ratio; the higher percentage corresponds to the smaller ROI area within the image. In the scenario without BS, a 1-bit Conv2D registers a performance of only 3.0-1.8 TOPS/W at voltages between 0.7-0.9 V (see FIGURE 21 (a)). With the application of BS, this range is boosted into a span between 27.7-10.2 TOPS/W. Notably, at a skip ratio of 0.9, BS provides a significant efficiency improvement, resulting in an enhancement of roughly 9.2\times compared to the scenario excluding BS. The effectiveness of BS is especially pronounced in the initial layers, attributed to their expansive input feature map sizes.

E. Overall Evaluations and Comparison

TABLE 2 summarizes the overall results at 20 MHz and 0.7 V from five modern tiny network models: MobileNetV2 [9], MobileNetV1 [33], the Visual Wake Words (VWW) [47] challenge 2019 champion model [10], and two MicroNet variants [11]. Note that MicroNet was used for the 8-bit model in accordance with the original paper. In addition, we evaluated a classification task using edge images to investigate the further possibility of data available at the edge environment, considering a potential integration with event vision sensors such as one proposed in [48]. For this purpose, we created and evaluated the edge VWW dataset using the edge extraction technique described in [49].

TABLE 2 Summary of the measurement results using advanced tiny NNs. (b) indicates the weight bitwidth
Table 2- 
Summary of the measurement results using advanced tiny NNs. (b) indicates the weight bitwidth

These comprehensive results highlight the capability of Pianissimo to provide practical inference speeds (inference/sec) throughout all model variations, including the 1-bit to 8-bit models. Inference/sec is calculated as \frac {\mathrm {frequency}}{\mathrm {cycles}} , where the number of cycles is obtained from ModelSim simulation. Since external memory accesses during inference are limited to input images and output results, these transfer times are negligible compared to the overall execution time. The results show competitive performance in the general image classification tasks, such as CIFAR-100 and VWW dataset, with results like 66.3 % accuracy on CIFAR-100. We also achieve an accuracy of 83.4 % for VWW with only contour edges. Furthermore, it performs an accuracy of 81.7 % at 24.9 FPS (\frac {20 \times 10^{6} \mathrm {[Hz]}}{803 \times 10^{3} \mathrm {[cycles]}} ), consuming just 793~\mu \text{W} using 4-bit MobileNetV1. Additionally, Pianissimo\vphantom {\sum ^{R}} has shown practical results with an AUC score of 92 % and throughput of 7.35 FPS in anomaly detection using 8-bit MicroNet and the MIMII dataset [45] of toy sound. This underscores that the utility of Pianissimo is not limited to visual tasks but can be applied to a broader range of applications.

TABLE 3 lists the measurement results for the two main evaluation models. As mentioned in Section III-A, the observation that power consumption decreases as bitwidth increases is also confirmed at the model-level analysis. When operating at 20 MHz and 0.7 V, the models work around 1 mW. When the conditions are adjusted to 80 MHz and 0.9 V, they work in the low-power range, staying below 10 mW. In addition, at these settings, peak performances of 5.148 GOPS and 7.961 GOPS were registered at 80 MHz. The table suggests that the models deliver competitive performance, even when accounting for their relatively modest levels of parallelism. In summary, Pianissimo ensures the practical inference capability in the wide range of NNs under the condition of ultra-low power.

TABLE 3 List of measurement results for the two representative evaluation models
Table 3- 
List of measurement results for the two representative evaluation models

TABLE 4 compares pianissimo with recent ultra-low-power inference accelerators [24], [25], [26], [27], [28]. Since the 4-bit weight model offers adequate precision (see 19(b)), we use this weight model as our evaluation standard. Pianissimo exclusively supports 8-bit activation to ensure sufficient accuracy. Typically, TOPS/W and GOPS have an inverse proportionality with bitwidth of both weight and activation.

TABLE 4 Comparison with the recent ultra-low power ML accelerators
Table 4- 
Comparison with the recent ultra-low power ML accelerators

While the accelerators proposed in [24] and [28], operate with ultra-low power consumption, they are confined to specific NN models, limiting their applicability across diverse edge environments. Similarly, CNN core in [25] can handle mixed 1-bit and 16-bit precision, but the minor impact on peak performance suggests its implementation lacks efficiency. [27] The speech enhancement accelerator presented in [27] supports DSCNNs and optimizes its computational complexity based on the frequency band. However, its range of supported and applicable applications is narrow. Unlike Pianissimo, it lacks a flexible control mechanism, such as RISC, to optimize power consumption.

TinyVers [26] stands out for its commendable efficiency across several models within the ultra-low power spectrum but presents certain limitations. Particularly, its adaptability leaves room for improvement. Notably, TinyVers does not accommodate the parameter-efficient DSCNNs. Also, a clear gap exists between its performance and theoretical efficiency when downscaling both weights and activations from 8-bit to 2-bit. Instead of achieving the ideal 16\times efficiency improvement, it reaches only 4.8\times . This disparity arises from TinyVers’ approach to mixed precision: it gates parts of the PE. This reveals suboptimal support for mixed precision in its design. Note that TinyVers utilizes a more advanced 22 nm FDX process, incorporates an eMRAM technology, and operates at an extremely low voltage of 0.4 V. On the other hand, Pianissimo runs under relatively older technology and higher operational voltages.

SECTION VI.

Conclusion

This paper presents a sub-mW class inference accelerator called Pianissimo, supporting progressively adjustable bit-precision. Leveraging a progressive bit-by-bit datapath, Pianissimo achieves adaptive precision that ranges from 1-bit to 8-bit. Remarkably, scalable precision applications, AP and MP, are obtained using a single weight set without reducing PE utilization. Pianissimo also supports BS processing using sensor information and suppresses unnecessary computation of non-ROIs. SW-HW cooperative control enhances the system’s flexibility, enabling it to accommodate various adaptive inference approaches. Our results show that Pianissimo achieves 0.49–1.25 TOPS/W at 0.7 V on MobileNetV1. Additionally, Pianissimo demonstrates practical performance across various models while operating on sub-mW class power. Thus, Pianissimo introduces a new dimension of flexibility to ultra-low power applications and shows promise in broadening the scope of use cases that can be efficiently supported. Future work will focus on integrating Pianissimo with actual sensor systems and exploring further flexibility enhancements through sparsity utilization within the low-power paradigm.

References

References is not available for this document.