A 3.8-μW 10-Keyword Noise-Robust Keyword Spotting Processor Using Symmetric Compressed Ternary-Weight Neural Networks

A ternary-weight neural network (TWN) inspired keyword spotting (KWS) processor is proposed to support complicated and variable application scenarios. To achieve high-precision recognition of ten keywords under 5 dB~Clean wide range of background noises, a convolution neural network consists of four convolution layers and four fully connected layers, with modified sparsity-controllable truncated Gaussian approximation-based ternary-weight training is used. End-to-end optimization composed of three techniques is utilized: 1) the stage-by-stage bit-width selection algorithm to optimize the hardware overhead of FFT; 2) the lossy compressed TWN with symmetric kernel training (SKT) and dedicated internal data reuse computation flow; and 3) the error intercompensation approximate addition tree to reduce the computation overhead with marginal accuracy loss. Fabricated in an industrial 22-nm CMOS process, the processor realizes up to ten keywords in real-time recognition under 11 background noise types, with the accuracy of 90.6%@clean and 85.4%@5 dB. It consumes an average power of <inline-formula> <tex-math notation="LaTeX">$3.8 ~\mu \text{W}$ </tex-math></inline-formula> at 250 kHz and the normalized energy efficiency is <inline-formula> <tex-math notation="LaTeX">$2.79\times $ </tex-math></inline-formula> higher than state of the art.


I. INTRODUCTION
T HE KEYWORD spotting (KWS) processor is a widely used speech-triggered interface for human-machine interaction, which plays an important role in wearable and mobile devices.Multikeyword and various background noises supporting can face a wide range of real-world scenarios, thus takes intensive research [1].
The deep neural network (DNN) has demonstrated prominent advantage in speech recognition [2].It mainly consists of feature extraction (FE) and DNN modules for word classification [3].State-of-the-art architectures present design challenges in terms of ultralow power consumption and real-time processing due to the computationally intensive requirements for achieving high recognition accuracy and noise robustness [4], [5], [6], [7], [8], [9], [10], [11].In Kim's work [11], the gated recurrent unit (GRU)-based recurrent neural network (RNN) digital classifier consumes the power of 13.6 μW.A DNN with one long and short-term memory (LSTM) layer and one fully connected (FC) layer is proposed in [1].The power of the classifier is 5 μW, but it can only applied in the near microphone scene.Quantization is an efficient way to realize the balance between recognition accuracy and power consumption [12], [13].Based on the binarized depthwise separable convolution neural network (DSCNN), the power consumption of the chip can be reduced to 0.51 μW, however, it can only support 1-2 keywords recognition [14].
In this work, we propose a ternary-weight neural network (TWN)-based noise-robust KWS processor, which is fabricated with an industry 22-nm technology.To our best knowledge, this is the first fabricated KWS processor that can realize ten keywords (12 classes) high-accuracy realtime keywords recognition with the power consumption of 3.8 μW.It can support 11 background noise types with 5 dB∼Clean wide range of signal-to-noise ratio (SNR).Our main contributions include the following.
1) A stage-by-stage bit-width selection algorithm (SBSA) for serial pipeline fast Fourier transform (FFT) to reduce the power consumption of FE.Bit-width of each stage is precisely selected by considering signalto-quantization noise ratio (SQNR).2) A compressed TWN with lossy configurable symmetric kernel training (SKT), which can support high recognition accuracy under a wide range of background noises without the cost of large memory capacity.3) An error intercompensation approximate addition tree with innovative Cartesian genetic programming (CGP)based error intercompensation scheme to further reduce the power consumption of TWN accelerator.The remainder of this article is organized as follows.Section II introduces the background and challenges of TWN-based KWS system.Three proposed techniques are described in Sections III-V.Section VI presents the chip implementation and measurement results.Section VII concludes this article.

II. CHALLENGES AND OUR SOLUTIONS
A typical digital DNN-based KWS system is shown in Fig. 1(a).After the speech signal is received by the microphone, it goes through the Mel frequency cepstral coefficient (MFCC) for FE, then performs the neural network calculation, and finally gets the classification result.In this process, MFCC, data/weight storage, and NN computation are the three main stages with the largest energy consumption in hardware processing.In the TWN-based KWS system, each stage poses different challenges.

A. FFT OPTIMIZATION FOR NOISE-ROBUST LOW-POWER KWS
FFT is one of the key processing units of MFCC for speech FE, which consumes most of the power due to the huge amount of computation and storage.The power consumption of FFT accounts for 62.1% in MFCC, while in the KWS system, the power consumption of MFCC accounts for 33.8% [23].Therefore, optimizing FFT calculation is an efficient way to reduce the power consumption.Nowadays, there are mainly three different structures to implement FFT: 1) parallel; 2) serial pipeline; and 3) memory-based loop recursion.Among them, serial pipeline FFT can achieve a better balance between hardware resources and throughput, which is more suitable for our lightweight KWS system.The core idea of serial pipeline FFT is multistage processing.For the complete N-point radix-2 FFT, it is decomposed into log 2 N-level butterfly operations.The butterfly operation of each stage does not interfere with each other, thus different stages can be calculated with different data bit-width.Fig. 1(b) shows the distribution of the maximum value and mean value of each stage results of FFT in the speech "left."It can be seen that the data sizes of different stages in FFT are different, how to precisely determine the data bit-width in each stage to achieve the balance between accuracy and hardware resource becomes a challenge.

B. TWN HARDWARE IMPLEMENTATION AND OPTIMIZATION WITH APPROXIMATE COMPUTING
Quantization is an efficient way for DNN hardware implementation.Recent works quantized the activations/weights to binary, thus eliminate the multiplication and alleviate the data storage [14], [23], [24].However, binarized data leads to severe loss of information.Compared with the binary weight network (BWN) in [23], TWN has a stronger weight expression ability [25].It can achieve higher recognition accuracy under more complex classification tasks, shown in Fig. 1(c) As shown in (1), where δ is the threshold and a is the scaling factor.The ternary-weight quantization scheme usually uses a scaling factor a multiplied by +1, −1, 0 to represent the weight, it has an additional 0 value compared to BWN.However, 0 does not affect the multiplication and addition operations, it increases the model sparsity and enables even less computation.Since each weight in TWN is 2 bit, this brings 2× storage overhead compared to BWN, how to further compress the model becomes a challenge.Joint optimization of algorithm and hardware in TWN can realize the KWS system with lower power consumption, on the basis of obtaining higher accuracy and wider noise adaptation range than BWN.
Besides, there are a large amount of addition operations in TWN, designing advanced approximate adders is an effective way to reduce computing power consumption.How to minimize hardware resource while maintaining the accuracy is another challenge.Typical approximate adders are of a single design, however, there is a potential for mutual compensation between different approximate components.Fig. 1(d) shows two 1-bit approximate half adders as an example.When both inputs are 1, the original wrong calculation results can be repaired to exact results through mutual compensation.Compensating with different approximate adders can even achieve higher accuracy while consuming less power.However, the compensation of calculation results is a probabilistic event, the complete compensation is difficult to achieve.Therefore, an effective compensation scheme is required to design the optimal compensation circuit.

C. OUR END-TO-END SOLUTIONS
The bottom of Fig. 1 shows our end-to-end solutions for these challenges, from FE to neural network computation.First, we propose the 512-point serial FFT with different bit-width in each stage, to reduce the hardware overhead of MFCC.Second, in neural network-based classification, we propose a novel symmetry kernel training method for TWN to reduce the weight storage and computations.Symmetrical weights enable more efficient model compression at the algorithm level, and unique data reuse and data flow at the architecture level.What is more, at the circuit level, a heterogeneous approximate adder tree is proposed to reduce the power consumption of DNN computation with tiny accuracy loss.A general error intercompensation scheme is proposed to solve the problem of statistically optimal error compensation between different approximation components.

III. STAGE-BY-STAGE BIT-WIDTH SELECTION ALGORITHM FOR SERIAL PIPELINE FFT
Since the FFT unit requires a large amount of intermediate data buffers for data regularization or temporary storage, reducing the data bit-width can significantly reduce the FFT calculation overhead.We propose the SBSA to precisely determine the data bit-width of each stage.In this section, we introduce the details of SBSA with an example.Based on the optimized FFT module, the frame-length adaptive-based MFCC is given, and we show how it works in the TWN based KWS system.
First, 512-point radix-2 single-path delay-feedback fast Fourier transform (R2SDF-FFT) proposed in [26] is utilized in this work.In the serial pipeline FFT structure, the previous stages occupy a larger amount of hardware resource and are more error tolerant, thus it has a larger optimization space and optimization benefits than the post stages.The SQNR is used to measure the overall error introduced by the bit-width truncation, the larger SQNR indicates the more accurate computation result.The SQNR is defined in (2), where X(k) represents the exact N-point FFT calculation result of a double floating-point number, while X(k) represents the approximate calculation result after bit-width truncated.MSE is the mean-square error The pseudocode of SBSA is shown in Algorithm 1.According to our previous work [27], the initial bit-width of each stage Wd_initial is set as 16.The data format is fixed-point and the number of fractionary bits is 9 in the FFT calculation.The SQNR_bound represents the tolerable error boundary.When reducing the bit-width of the current stage, the bit-width of other stages is unchanged, and the bit-width of each stage is optimized from the first stage within the error boundary.It should be noted that the lowest bit-width Wd_limit of each stage is theoretically the bitwidth of the input data, i.e., the output bit-width of the previous stage.The Wd_limit of the first stage is the bitwidth of the original speech data.For example, for one speech signal of GSCD, we gradually reduce the bit-width of the first stage and calculate the SQNR until the set error boundary is exceeded.After the bit width of the first stage is determined, the minimum bit-width of the second stage is set as the bit-width of the first stage, and continue to optimize the bit-width of the second stage.The rest of the stages are performed the same.In this way, we achieve a tradeoff between hardware optimization and accuracy evaluation.As the error boundary decreases, the hardware overhead of FFT will decrease, but the accuracy will also deteriorate, so the error boundary needs to be set reasonably.After applying the SBSA to speech signals with different complexity, some specific intermediate stages may still have a relatively large bit-width deviation.Thus, the bit-width of speech signals with different complexities needs to be comprehensively considered, to determine the optimal bit-width selection scheme.One thousand original speech of different keywords are randomly extracted from GSCD, and the proportion of speech signals with different bit-width selection results in each stage is counted and shown in Fig. 2. The error boundary is set to SQNR_bound = 50 dB.It can be seen that the data bit-width increases gradually when the stage becomes deeper.The first two stages have lower and uniform requirements for the bit-width.From the third stage, the bit-width distribution shows a certain difference, which is because the speech signal of different complexity has different requirements on the bit-width of the FFT calculation at each stage.There are 25% and 65% of the 1000 speech signals that can use 14 bit-width and 15 bit-width in stage 9, and the rest 10% signals need use 16 bit-width to ensure accuracy.Therefore, when determining the data bit-width of a specific stage, the bit-width threshold T(T ∈ [0, 1]) should be reasonably determined, e.g., the proportion of the highest data bit-width m for the k th stage is not less than T, then the bit-width of the current stage is set as m; otherwise, the bit-width is set as m − 1.According to the previous analysis, the resource optimization revenue of the previous stages is greater than that of the latter and is less error sensitive, thus, the bit-width threshold Tsi of the current stage is reduced stage-by-stage.In this article, the Tsi of the last four stages are set as 0, while the Tsi of the remaining stages are set as  0.1.Finally, the bit-width is set as [10,10,11,12,12,13,14,15,16].
We utilize the frame-length adaptive MFCC structure, which is proposed in our previous work [27], and the architecture is shown in Fig. 3(a).It consists of SNR-based voice activity detection (VAD) unit, optimized FFT module-based MFCC, and TWN accelerator.The input speech signal is sampled at 8 kHz, and the entire architecture operates on the frames of 40 ms with 20-ms time step.The VAD unit, which can preclassify the input speech frames [see Fig. 3(b)], is used to achieve voice activity detecting and redundant frame skipping.When the processor classifies the input speech, a large number of speech frames are redundant, and it is effective to avoid calculating unnecessary frames in speech with different word lengths.When the short-time energy of the input speech signal is less than the predefined threshold, the unit considers the signal as nonvoice input and fills the frames with 0 as the output to TWN accelerator.As shown in Fig. 3(c), the computation of MFCC can be bypassed with redundant frame padding 0 to save the energy consumption.Through the skip_0 unit in the control module, the neural network classifier can also adaptively skip the calculation of redundant frames.What is more, the computation of large amount of 0 weights can also be skipped.With the use of this structure, effective frames which contain the necessary information can be obtained and the computing power consumption can be optimized, as shown in Fig. 3(d).

IV. SYMMETRIC KERNEL TRAINING-BASED TWN
Symmetrical weights have the following two benefits in hardware implementation.First, only one of the two groups of symmetric weights needs to be stored, thus reducing the weight storage.Second, the data flow for a symmetric kernel can be optimized for more energy-efficient computing.This section introduces the SKT-based TWN implementation from the algorithm to the hardware, including data reuse under different symmetric modes and the customized data flow.

A. TRAINING PROCESS OF SYMMETRIC TWN MODEL
The proposed TWN model is shown in Fig. 4.Each input speech frame will be extracted into 26 Mel Coefficients through MFCC in the frequency domain, and the 1-s speech will be processed into 49 frames in the time domain.The NN model consists of four convolution (Conv) layers and four FC layers, with batch normalization (BN) layers and rectified linear unit (ReLU) layers between them.The kernel size of all Conv layers is consistent 3×3, other hyperparameters are shown by data in the figure.Compared with NN models utilized by other KWS processors, more FC layers are adopted in our TWN to improve the classification accuracy and the noise robustness.The Conv2 layer and Conv3 layer take up the largest number of operations, accounting for 75.58% and 19.83%, respectively, and are accompanied by a large amount of memory storage.Therefore, optimizing these two layers can effectively reduce the system hardware overhead.In the traditional TWN model, the threshold δ and scaling factor a are not trainable, and it is impossible to effectively change the threshold to obtain a weight parameter matrix with high sparsity.What is more, the recognition accuracy of the network still has a large gap compared to the floating-point network.He and Fan [28] pioneered the use of truncated Gaussian approximation (TGA) for weight ternary processing.This algorithm focuses on updating thresholds in ternary networks rather than scaling factors.The TGA method can jointly train the quantizer thresholds and weights, to minimize the accuracy drop caused by model compression.However, the sparsity in TGA is not controllable.In this work, we introduce an adjustable sparsity parameter S l to scale the threshold δ of the ternary weights to control the weight sparsity.As shown in Fig. 5(a), the ternary weight update is divided into two steps.First, the threshold of the ternary weight is calculated according to the weight of the current layer l and the sparsity parameter S l .Second, the weight is updated according to the newly updated threshold.After the sparsity-controllable quantization scheme, the sparsity of the weight is increased to 33.6%, 6.5× higher than the full-precision model.
In order to further utilize the sparsity-controllable compression scheme and reduce the computation and storage overhead of TWN, SKT is proposed in this work, and its overall training method is shown in Fig. 5 The ternary weight threshold is calculated according to mean value μ l , variance value σ l of all weights in each Conv layer, and sparsity parameter s l .The ternary weight is updated according to the newly computed threshold and the ternary function [28].The scaling factor a in Conv layers and FC layers are extracted and combined to the BN layer, thus the multiplication can be processed with NAND gate and adder.By adopting the proposed SKT method, the weight in TWN can be significantly compressed, with the accuracy loss less than 1% under different background noises.

B. SYMMETRIC MODES AND DATA REUSE IN SKT-BASED TWN
Increasing repeated weights has become an effective way to improve the energy efficiency of DNN accelerators [29].In TWN, the phenomenon of repeated weights becomes more significant, in 3×3 convolution kernel, 33.3% repeated weights occur when the weight distribution satisfies the uniform distribution.To further utilize the repetitive weights, we propose SKT, which enables efficient weight compression and data reuse.Three different symmetric modes can be obtained by SKT, including transverse symmetry, longitudinal symmetry, and diagonal symmetry, shown in Fig. 6.For each symmetric mode, there are single-weight symmetry and group-weight symmetry.Single-weight symmetry means that only two weights at the corresponding position is equal, while group-weight symmetry means that multiple consecutive weights are the same.As single-weight symmetry can only generate little data reuse in hardware implementation, we mainly focus on group-weight symmetry in this work.In the TWN kernel without SKT, 34.59% of the single weight and 6.17% of the group weight are intrinsically symmetric, and the recognition accuracy is 92.2%.After utilizing SKT in Conv2 and Conv3 layer, the symmetric rate of group weight and single weight is increased from 6.17% to 66.46% and 34.59% to 89.48%, respectively.The recognition accuracy is reduced by 0.5%.Utilizing SKT in the Con1 layer simultaneously will not improve the symmetry rate significantly, while the accuracy loss is 4.3%.Since Conv2 and Conv3 are also the layers with the largest number of parameters and operations, we adopt the SKT methods in these two layers.
The data reuse corresponding to different symmetric modes is also different.Since diagonal symmetry has symmetrical convolution weights in both the time domain and the frequency domain, it is difficult to generate a large amount of reusable data in the convolution process, so it is not suitable for deployment in hardware implementation.For transverse symmetric and longitudinal symmetric kernels, they are easier to generate data multiplexing, and their corresponding computational data flow is shown in Fig. 7.For transverse symmetric kernels, when the frames are sequentially input to processing elements (PEs) through the FIFO, they will be convolved with the corresponding weights.The weight reuse occurs in the frequency domain, the weights of the first row can be explicitly reused with the weights of the third row in one convolution operation.What is more, after the kernel moves twice in the frequency domain, the previously calculated output can be reused.For a longitudinally symmetric kernel, output reusing occurs only when the filter moves in the time domain, i.e., the different input frames, which makes it harder to achieve.Besides, through our experiment, the transverse symmetric can achieve the highest accuracy (91.9%), which is 0.7% higher than longitudinal symmetric and 1.2% higher than diagonal symmetry.Therefore, we adopt transverse symmetry for all symmetric kernels.

C. ARCHITECTURE AND DATA FLOW OF TWN ACCELERATOR
For the efficient hardware implementation of the symmetric convolution kernel, we propose an optimized TWN accelerator, the overall structure is shown in Fig. 8(a).It mainly consists of the computing array with 30 PEs, the main control unit to schedule the load and store of the weight and data, the weight decode unit to decompress the encoded weight, the post-process unit to calculate BN and ReLu layers and on-chip SRAM for data/weight/parameter storage.
The sparsity-controllable and symmetric training method of TWN enables weight compression coding to effectively reduce the weight storage.This work utilizes the Huffmanrun length coding (H-RLC), which is obtained by combining run length coding (RLC) and Huffman coding, to achieve a better balance between compression rate and decoding complexity.As shown in Fig. 8

weight value, and Coding Table is used to indicate the number of repeated weights. The Coding Table is encoded by
Huffman coding, it can encode the repetition number of RLC more efficiently.With this approach, the weight storage can be reduced by 2.74 kB.Fig. 8(c) shows the H-RLC decoding unit, which sequentially outputs the identical repetition weights and weight valid signals through a finite-state machine (FSM) and counter.
The PE array adopts the serial pipeline computing method to process the convolution layer by layer.Each PE can be controlled by the main control unit."Weights info" is a 1-bit flag signal stored right after each 3×3 convolution kernel, which indicates whether the kernel is symmetric.For the symmetric kernels, the weight control unit will control the PEs to store the reuse weights and outputs, shown in Fig. 8(d).The memory control unit will only load one group of symmetric weights, thus improving the memory access efficiency.The SKT method brings an additional 1.55 kB of weight compression.Compared to the TWN accelerator without SKT, the power consumption of SKT-based optimized TWN accelerator can be reduced by 18.6%.
Fig. 9 shows the customized data flow with and without symmetry.First, three weights and data are processed in one PE, and the 3×3 convolution is processed by a set of 3 PEs in one cycle.Then, in the t+1 clock cycle, the weight originally input to three PEs remains unchanged, the input data to PE 1 at cycle t is transmitted to PE 0 , the input data of PE 2 is transmitted to PE 1 , and PE 2 inputs the new data to process a new convolution.In the symmetric convolution kernel, for a convolution calculation with a frequency-domain step of 1, a repeated partial sum will appear in PE 0 every two times the input data is moved.Therefore, PE 0 is not processed in the t+2 cycle, the output partial sum is directly the results stored in the "reuse output" register of PE 2 in cycle t.
Data tiling is the standard methodology in the DNN processor [14].In this work, in order to adapt to the designed PE structure and symmetrical data flow, the PEs in the PE array are divided into groups of three.Thirty PEs are divided into ten groups to process ten output channels simultaneously, and the data is broadcast to these ten groups of PEs.Input channels will be processed sequentially in a group of PEs.Using this time-sharing multiplexing method, the convolution of large input channel and output channel can be processed more effectively.

V. ERROR INTERCOMPENSATION APPROXIMATE ADDITION TREE WITH CGP-BASED ERROR INTERCOMPENSATION SCHEME
Since multiplications in convolution are eliminated in TWN, the addition operation is revealed to have a significant impact on the computing power dissipation.In this section, we propose an energy-efficient approximate addition tree to further reduce the computing overhead.An error intercompensation scheme is proposed to realize the optimal compensation between different approximate adders in the approximate addition tree.

A. ERROR INTERCOMPENSATION APPROXIMATE ADDITION TREE
As shown in Fig. 10(a), the 3×3 convolution kernel is organized as a 2-stage addition tree.In each stage, a compensation block is composed of two different approximate adders, and processed within one cycle.These two approximate adders are constructed to compensate the error of each other.A 32-bit lower-part-OR adder (LOA) [30] is used to accumulate 10 6 random data.The result is evaluated by error distance (ED), which is defined as the absolute value of the difference between the exact result and the approximate result, shown in Fig. 10(c).The addition tree has a lower ED than the traditional accumulator, while the error intercompensation addition tree has the lowest.Fig. 10(b) shows the 16-bit approximate adders, full adders (FA) are used in the high bits and the low 8 bit can be configured as FAs or approximate adders every 4 bit."Approximate EN" is used to configure accurate or approximate modes.4-bit approximate adder block 1 is composed of four ORgate approximate adder with no internal carry and external carry out.The approximate adder block 2 is the compensation approximate adder, composed of four wires directly connecting input "A" to output "Sum" with a fixed external carry out "1."It is the iterative result of our proposed compensation scheme, also called the compensation adder (CA).Compared with FA, OR-gate approximate adder and CA have fewer transistors and shorter delays.We evaluated the performance of TWN accelerator deployed with exact and approximate adders at the circuit level, on an industrial TSMC 22-nm process technology with the voltage of 0.54 V and clock frequency of 250 kHz.The experimental results are shown in Fig. 11.The approximate addition tree can bring a power reduction of 17.3% in TWN accelerator, with the accuracy loss of 0.5% (Clean)∼1.3%(5 dB).

B. CGP-BASED ERROR INTERCOMPENSATION SCHEME
In DNN, error of uniform distribution is more likely to compensate each other during the calculation and has a higher accuracy.We originate a CGP-based error intercompensation scheme to compensate the error of a certain approximate structure.CGP is a heuristic automatic design method.Through this scheme, the error distribution of the compensated heterogeneous addition tree presents a uniform distribution of positive and negative errors and achieves an optimization of calculation accuracy and hardware efficiency.OR-gate adder has a simple structure which is suitable to be chosen as the baseline approximate adder.The error distribution of the 4-bit OR-gate adder block is obtained through simulation and then be input to the fitness function of CGP as a parameter.The target error characteristics is also computed as the searching parameter of CGP.Finally, the compensation adders are grouped into the error intercompensation addition tree and deployed on the TWN accelerator.The error distribution of the addition tree becomes more uniform after applying the error intercompensation scheme.
In CGP, each chromosome (i.e., candidate result in iteration) is a 2-D graph structure, and the function of each node can be modeled as Boolean functions, it has been proven to generate high-quality approximate circuits [31].All generated candidate circuits are evaluated by a fitness function, which can be expressed as the mismatch between generated circuit accuracy and target accuracy.The fitness function of our compensation scheme is shown in (3).Sum of error distances (SED) characterizes the overall error characteristics of the approximate adder.The target is a 4-bit approximate adder block, so there are a total of 256 input combinations.As shown in (4), S represents the calculation result of the candidate approximate adder, and S represents the exact calculation result, i and j are each input situation of two 4-bit addends.SED ORA represents the error of the OR-gate approximate adder obtained by the previous exhaustive simulation, which is used to control the overall error of the generated compensation approximate adder close to the given OR-gate adder.SED CA represents the error of the compensation approximate adder.The SED of CA is set to be twice as the SED of OR-gate adder, due to the intrinsic error of two addends Notice that Cost is the cost function we introduced to constrain the generated approximate circuit, shown in (5).Only when the error of the generated approximate circuit is all positive, the Cost is 0. It ensures that the CGP evolves in the direction of the approximate adder whose error distribution is all positive to meet the compensation requirements.The extremely simple structure of the CA generated by the error intercompensation scheme is shown in Fig. 10(b).

VI. CHIP IMPLEMENTATION AND EVALUATION RESULTS
Fabricated in TSMC 22-nm ultralow-leakage (ULL) process technology with HVT transistors, the prototype tapeout results of the proposed TWN KWS processor is shown in Fig. 12.To evaluate the power consumption and recognition accuracy, the setup test platform mainly includes power supply, oscilloscope, microphone, PC, and a testing PCB board with the chip on a socket.PC transmits the weight parameters required by the chip to the chip through UART, and then the external voice signal is collected by the commonly used external digital silicon microphone SPH0645, and the voice signal is transmitted to the chip through I2S.We conducted tests on a total of 30 chips to evaluate their performance.The average power consumption ranged from 3.8 to 150 μW, with 250 kHz-10 MHz clock frequency from 0.54 to 0.9 V.The chip testing result is listed in Fig. 12, the prototype system is functional with a supply voltage of 0.54 V and a clock frequency of 250 kHz.The chip achieves a minimum power consumption of 3.8 μW, with an on-chip 14-kB memory and a core area of 0.99 × 0.61 mm 2 .

A. ACCURACY AND POWER MEASUREMENT
The FE module operates on the 20-ms time step at 8-kHz sample rate.It can be dynamically reconfigured to padding 0 for redundant frames.We use the Google speech command dataset (GSCD) as the training and validation set, which contains 105K 1-s-long audio clips of 35 keywords in the dataset.Table 1 shows the chip testing accuracy under different noise scenarios.A total of 11 different background noise types, which are all from NoiseX-92 dataset, and four different SNR intensities of 5 dB, 10 dB, 15 dB, and Clean were used for testing.Under various noise types, the chip has a high noise robustness, reaching the accuracy of over 85% under 5 dB, over 88% under 10 dB and over 89.1% under 15 dB.Compared with the software baseline, there exist slightly accuracy losses, within 1%, in chip testing result, which is caused by the subtle differences between the test environment and the software simulation environment.

B. ANALYSIS OF AREA AND POWER BREAKDOWN
The area and power breakdowns of the processor and TWN accelerator are illustrated in Figs. 13 and 14, which are obtained by post layout simulations with a supply voltage of 0.54 V.It contains TWN accelerator module, MFCC module, Top Control module, and some other I/O interfaces.Among them, the TWN accelerator module occupies the largest proportion, accounting for 58% of the total area and 70.42% of the total power consumption, respectively.Due to the SBSA of FFT, the power consumption of the MFCC module is significantly reduced.In the processor, leakage power and dynamic power consumption accounted for 42.7% and 57.2% of the total power consumption, respectively.The TWN accelerator is divided into the control unit, data memory, weight memory, PE array, and nonlinear activation unit.Among them, PE array takes up the largest proportion of area, reaching 55.67%, while data memory consumes the largest power consumption of 38.84%.Several techniques are used to cut down the power consumption.First, the complex control and multiplexers enable the reuse of output partial sum and weights in symmetric convolution kernel during calculation.Second, the design of an intercompensation approximate addition tree reduces the addition power consumption.Since the data memory needs to store the calculation results in each layer, and the weight memory is compressed by SKT, the area and power consumption of weight memory is much lower than data memory.Leakage power and dynamic power accounted for 40.9% and 59.1% of the total power consumption of the TWN accelerator.

C. COMPARISONS WITH STATE-OF-THE-ART WORKS
Table 2 lists the comparison between this work and state-ofthe-art KWS chips.Compared with [1], [5], and [11], our work can support the background noise from 5 dB∼clean, while reaching the lowest power consumption.Compared with [14], our work can support more keywords (12 classes with silence and unknown) recognition and shorter response time.The high noise robustness in this work is benefited from the use of deeper network architecture with more layers and finer-grained weight adjustment capability brought by ternary weights.The proposed TWN module has the largest operation per frame compared to the other works.Based on the more complex DNN topology which is more error tolerant than other works, the proposed TWN can be trained using a dataset that incorporates speech feature extractions from each different background noise.This also increases the noise robustness of this work.By adopting the software and hardware co-optimized SKT-based TWN accelerator with bit-width optimized FFT and approximate addition tree, our KWS chip can achieve the highest normalized energy efficiency in the work of ten keywords, which is 2.79× higher than previous work [5].

VII. CONCLUSION
This article proposed a 5 dB∼Clean wide range of background noises supported and low-power ternary-weight KWS processor with end-to-end optimized hardware deployment.To achieve the balance between high accuracy and ultralow power, the bit-width of each stage in FFT can be accurately selected.In keywords classification, we propose the SKT to compress the weight and calculations in TWN.The architecture of TWN accelerator and data flow is then optimized to utilize the weight/output reuse of the symmetrical TWN.By using an error intercompensation approximate addition tree, the computing power can be further reduced in TWN accelerator.The processor is fabricated on 22-nm CMOS technology, with the ultralow power of 3.8 μW.Compared to the state-of-the-art, the proposed design performs 2.79× better referring to normalized power efficiency.

FIGURE 1 .
FIGURE 1. End-to-end design challenges and our solutions in TWN-based KWS system.(a) Neural network-based keyword spotting.(b) Challenge 1: Different stages of FFT have different data sizes.(c) Challenge 2.1: Large amount of weight storage in TWN.(d) Challenge 2.2: Compensation potential for different approximate circuits.

FIGURE 3 .
FIGURE 3. Frame-length adaptive TWN-based KWS system with SNR-based VAD unit.(a) Architecture of frame-length adaptive KWS system.(b) SNR-based VAD Unit.(d) Effective frames are retained.(c) Redundant frame padding 0 and the benefits.
(b).At the beginning of training (0th epoch), the weight, threshold, and scaling factor are initialized based on a pretrained neural network.At the t-th epoch, the weights and thresholds are updated based on the structure of the last epoch.Two steps are processed in the overall training, first weight symmetrized and then ternarized.Two operations can be used to compute symmetrical weights, maximum or average.Through experiments and comparison, averaging the weights at symmetrical positions can obtain a higher accuracy.Then, the symmetrical weights are ternary quantized.

FIGURE 6 .
FIGURE 6. Symmetric modes and symmetric rate in TWN.

FIGURE 7 .
FIGURE 7. Data reuse in (a) transverse symmetric kernel and (b) longitudinal symmetric kernel.
(b), Flag is used to indicate whether the weight is encoded, Weight is the 2-bit

FIGURE 9 .
FIGURE 9. (a) Data flow in PE array without symmetry.(b) Data flow in PE array with symmetry.

FIGURE 11 .
FIGURE 11.Approximate effect on recognition accuracy and power consumption (software baseline).

FIGURE 12 .
FIGURE 12. Test platform and results of the chip.

FIGURE 13 .
FIGURE 13.Power breakdown (left) and area breakdown (right) of the processor.

FIGURE 14 .
FIGURE 14.Power breakdown (left) and area breakdown (right) of TWN accelerator.