On-Device Customization of Tiny Deep Learning Models for Keyword Spotting With Few Examples

Designing a customized keyword spotting (KWS) deep neural network (DNN) for tiny sensors is a time-consuming process, demanding training a new model on a remote server with a dataset of collected keywords. This article investigates the effectiveness of a DNN-based KWS classifier that can be initialized on-device simply by recording a few examples of the target commands. At runtime, the classifier computes the distance between the DNN output and the prototypes of the recorded keywords. By experimenting with multiple tiny machine learning models on the Google Speech Command dataset, we report an accuracy of up to 80% using only 10 examples of utterances not seen during training. When deployed on a multicore microcontroller with a power envelope of 25 mW, the most accurate ResNet15 model takes 9.7 ms to process a 1-s speech frame, demonstrating the feasibility of on-device KWS customization for tiny devices without requiring any backpropagation-based transfer learning.

Designing a customized keyword spotting (KWS) deep neural network (DNN) for tiny sensors is a time-consuming process, demanding training a new model on a remote server with a dataset of collected keywords.This article investigates the effectiveness of a DNN-based KWS classifier that can be initialized on-device simply by recording a few examples of the target commands.At runtime, the classifier computes the distance between the DNN output and the prototypes of the recorded keywords.By experimenting with multiple tiny machine learning models on the Google Speech Command dataset, we report an accuracy of up to 80% using only 10 examples of utterances not seen during training.When deployed on a multicore microcontroller with a power envelope of 25 mW, the most accurate ResNet15 model takes 9.7 ms to process a 1-s speech frame, demonstrating the feasibility of on-device KWS customization for tiny devices without requiring any backpropagation-based transfer learning.
A voice command sensor placed on everyday objects can recognize a set of target keywords to enable speech-controlled functionalities.The keyword classification algorithm, commonly denoted as keyword spotting (KWS) in the literature, 1 runs locally on-device to process the audio data recorded by the microphone.These smart sensors are typically battery powered, and microcontroller units (MCUs) are used as data processing engines to meet the stringent energy requirements.MCUs feature a power consumption of up to a few tens of milliwatts but, on the other side, present limited computation power and on-chip memory capacity, making the porting of robust speech processing pipelines on-device highly challenging.
Recently, deep neural networks (DNNs) for KWS have been efficiently implemented on low-power MCUs. 2 These DNN solutions feature up to only a few hundred thousand parameters to fit the memory constraints of tiny MCU devices, typically lower than a few megabytes.Small-sized solutions, which belong to the recently born tiny machine learning (TinyML) domain (which deals with machine learning approaches for tiny and resource-constrained devices), are also applied for wake word detection-e.g., "Hey Snips!" 3 where the multiclass keyword classification turns into a simpler binary problem.At present, however, the state-of-the-art TinyML KWS design approaches demand a time-consuming data collection and labeling procedure to train application-specific DNN models using high-performance servers, which represents a bottleneck for rapid prototyping in new environments and target scenarios.
To address this limitation, this article explores whether DNN-based KWS classifiers tailored for TinyML devices can achieve simple and fast on-device customization of the target keywords using only few (shot) recorded examples.The overview of the adopted approach is depicted in Figure 1.First, a DNN-based feature extractor is trained on a large labeled dataset, which is disjoined from the target environment, using the triplet loss that has been recently demonstrated to be effective for learning a robust representation for KWS systems. 4The DNN model is fed by temporalfrequency (TF) maps extracted from a speech segment of fixed length and outputs an embedding vector in a low-dimensional space, e.g., 64-256 feature values.The learning process aims at reducing the distance between the embedding vectors belonging to the  We analyze the effectiveness of DNN feature extractors trained with the triplet loss for the customization of a set of KWS models and discuss the insights of the learned representation.
We describe a design methodology to realize an on-device, customized KWS sensor device powered by a tiny MCU engine.We analyze and compare the accuracy scores of multiple KWS models and their latency, memory, and energy costs on a state-of-the-art MCU chip.
Our approach is tested on the public Google Speech Command (GSC) dataset considering multiple tiny DNN backbones as feature extractors.We run experiments using few-shot enrollments from the target scenario and assess the quality of detecting custom keywords while being robust to false positives.When using only 10 examples, we show that a top accuracy of up to 80% at a false acceptance rate (FAR) of 5% can be achieved with a ResNet15 model, þ3% and þ11% higher than, respectively, large and small depthwise-separable convolutional neural network (DS-CNN) models.We observe these scores to be lossless with respect to 8-bit quantization.On the best-inclass GAP9 MCU, which features a compute cluster with nine RISC-V cores and a hardware accelerator for convolution operations running in a power envelope of 25 mW, 7 the most accurate ResNet15 model takes only 9.7 ms to process 1 s of streaming audio.Overall, our solution is the first to provide on-device customization of the target keyword in the TinyML domain without requiring any backpropagation-based learning of the classifier using data from the target scenario.

DNN FEATURE EXTRACTOR DESIGN
Let fðÁÞ be a DNN feature extractor that operates on speech utterances of 1 s with a sampling frequency of 16 kHz.The audio signal is preprocessed by computing the first 10 mel frequency cepstral coefficients (MFCC) features.We use a frame size of 40 ms and a stride of 20 ms; Hamming windowing is applied.As a result, a TF map of size 49Â10 and denoted as x S is fed to the DNN model.
The feature extractor is trained on a source dataset disjoined from the target scenario.We employ the recent Multilingual Spoken Words Corpus (MSWC) dataset 6 , which includes up to 39,000 unique utterances in the English train partition.The average number of occurrences per class is 180; the class distribution is highly unbalanced.In our experiments, we only consider speech samples from the 500 categories with the highest number of occurrences, for a total amount of 2.7 million samples ($5470 utterances per class on average).We adopt the triplet loss L TR as the optimization function to learn the parameters of the feature extractor, as proposed in a recent work. 4 The training process is iterated for multiple epochs structured in episodes.At every training episode, the data loader samples a batch of Q random samples from M randomly chosen classes (from the 500 available) for a total of Q Â M utterances.After the MFCC

TinyML
November/December 2023 IEEE Micro preprocessing, the batch of TF input data, denoted as For every batch, a set of N tr triplets is drawn by associating to every couple of samples from the same category a random point from a different category (random negative mining 4 ).Given a triplet (z s i,c , z s j6 ¼i,c , z s neg ), where z s neg is a sample of the current batch not belonging to the class c, the triplet loss is computed as In more detail, dðÁÞ is the squared Euclidean distance, and m is the margin.Table 1 lists the feature extractor models considered in our study.We report the number of parameters and the millions of multiply-and-accumulate (MAC) operations per inference.Also, we indicate the size of the largest activation map of the model, which impacts the requirement of read-write memory, i.e., RAM memory of the MCU.On the contrary, model parameters are fixed and can be stored in a read-only memory (e.g., on-chip flash memory).Among the others, we run experiments with DS-CNN models of different sizes. 2 These convolutional architectures replace common 2-D convolution layers, e.g., those used in ResNet15, with a combination of depthwise and pointwise layers to reduce the number of parameters and operations.On the contrary, the temporal convolution (TC) ResNet8 makes use of 1-D convolutions after reshaping the input TF data and treating the MFCC features as input channels.
Every model computes the embedding vectors from the output of the convolution-based backbone.We refer to this operation as embedding compute in the "Experimental Results" section.Note that ResNet15 and TCResNet8 feature a BatchNorm layer after the last convolution layer as proposed in the original work using triplet loss. 4On the DS-CNN backbones, we place instead a layer normalization layer, i.e., LayerNorm, which was observed to lead to a good representation. 12After the normalization, the embedding vectors are computed using a spatial average pooling layer and applying L2 normalization.The size of the final embedding vectors is reported in Table 1.

ON-DEVICE FEW-SHOT KWS
After training, the feature extractor is frozen for deployment on the MCU platform.We apply 8-bit posttraining quantization to calibrate the quantization ranges.A prototype-based classifier is initialized on top of the feature extractor with few examples of the target speech commands.The speech samples can be collected on-device by recording multiple audio segments of 1 s including the command.More formally, the KWS classifier setup takes K examples for the N custom classes fx T i,j g, i ¼ 1, ::K, j ¼ 1::N, i.e., the speech keywords.For simplicity, this study considers the same number of examples for every target class even if the number of samples per class can be different.
After the audio recording, the TF map of every audio keyword is fed to the feature extractor to compute the embedding vector z T i,j .For every class j, a prototype vector c j is calculated as In addition to the target classes, we add a prototype for the silence class and one for the unknown class.These vectors are computed by feeding, respectively, K samples of background noise and K random words to the feature extractor.We assume that none of the unknown examples include a target keyword.These extra prototypes can be computed before deployment and directly stored on-device.
At inference time, the TF map x of the current speech segment is classified by predicting the class label y pred as where y unk is the index of the unknown class, and c is the score threshold.This c parameter can be tuned to adjust the false acceptance rate (FAR) of the classifier, i.e., the ratio of unknown samples classified as a target keyword.A lower value of c forces a more conservative behavior by predicting more samples as unknowns.Note that the optimal tuning of c to achieve a target value of FAR can only be derived a posteriori.In this work, we simply propose setting the value of c as the margin m during the test.We motivate this choice by noting that the loss function (1) drives the training process to guarantee a class separation distance larger than m, on average.In our study, we also investigate the impact of applying a softmax normalization to the distance vector dðfðxÞ, c j Þ in (3), which is commonly used in other few-shot learning approaches, e.g., ProtoNet. 9

EXPERIMENTAL RESULTS
We train the feature extraction models of Table 1 on the MSWC dataset for 40 epochs of 400 episodes using Adam as the optimizer with a learning rate of 0.001, which is divided by 10 after 20 epochs.Every episodic batch loads 80 samples from 20 classes, for a total of 1600 data per batch, from which triplets are sampled.The margin m is 0.5.To account for different initialization seeds, every training process is repeated three times, and every trained model is individually tested.The final results are averaged over the multiple runs.
For the testing phase, we use GSC, a collection of speech utterances of 1 s belonging to 35 categories.We split the dataset into a positive partition GSCþ and a negative partition GSCÀ.GSCþ is composed of 10 target keywords: on, off, left, right, up, down, go, stop, yes, and no.We use five categories (backward, forward, visual, follow, and learn) to compute the prototype of the unknown class.The GSCÀ partition includes the samples from the remaining 20 classes.We keep the original train/test split of GSC, and we use the train split to fetch the (few) samples to compute the prototypes of (2).The silence prototype is obtained by feeding the DNN model with background noise waveforms also belonging to the GSC dataset.
To assess the quality of a trained feature extractor model on the target GSC dataset, we measure the classification accuracy (ACC) as the ratio of correctly classified utterances of the GSCþ test set.On the other hand, we use GSCÀ to assess the FAR as the percentage of negative samples misclassified as positives.An ideal classifier scores ACC ¼ 100% and FAR ¼ 0%.However, in a real system, the ACC score tends to decrease when the FAR decreases, e.g., by reducing the threshold (3).
Table 2 reports the ACC scores for the considered models at FARs of 5% and 1%, obtained by opportunely tuning the threshold a posteriori.We run experiments using one, three, five, and 10 shots, optionally applying the softmax on the distance vector (3).Because the results are impacted by the selection of the few-shot samples to compute the prototype, we repeat every test 10 times and average the obtained statistics.Using the L2 Euclidean distance for classification without softmax systematically leads to the highest accuracy, which increases with the model capacity and the number of examples used to compute the prototype.Among the tested models, ResNet15 achieves the highest scores of 80% and 70% at FARs of, respectively, 5% and 1%.In particular, the accuracy level at FAR¼5% is 3% and 11% higher than the ACC of DS-CNN-L and DS-CNN-S.Conversely, a maximum accuracy of up to 74% is reached with a low number of examples, i.e., three.To get more insights on the tradeoff between accuracy and FAR, Figure 2 visualizes the receiver operating characteristic curve of the one instance of the models initialized with 10 examples.
Figure 3 plots ACC versus FAR scores with a fixed threshold c ¼ m ¼ 0:5, as proposed for the deployment.

TinyML
The Pareto front (upper left corner) is populated by instances of the ResNet15 model.With a low fixed threshold, the FAR achieves a value as low as 0.1%-0.2%,and the accuracy increases up to 54% using 10 examples.Under the same settings, DS-CNN-L reaches an accuracy of 58% at an FAR of 0.6%, meaning the classifier will trigger three times more false positive alarms per hour than ResNet15.In the plot, we observe a trend where models with a lower capacity present a higher accuracy and FAR than models with a higher capacity.This is explained by the relaxed compactness of the target class clusters achieved by the big models at the cost of a superior interclass distance.

MCU DEPLOYMENT
We assess the deployment costs of our solution on the GAP9 MCU, which includes a compute cluster with nine RISC-V cores, supporting vectorized half-precision floating point (FP16) and 8-bit integer MAC instructions, and a hardware (HW) convolution accelerator.To quantize and deploy the DNN models, we adopt the GAPflow toolset, which is provided by the chip manufacturer.Every model is quantized to 8 bit with asymmetric per-channel quantization ranges, as supported by the HW accelerator.We rely on post-training quantization by feeding four random training samples to the frozen DNN model to estimate the quantization ranges.Table 3 reports the accuracy before and after the quantization of a 10-shot test with four classes (yes, stop, go, and no) in addition to the unknown and silence prototypes.We reduced the number of keywords with respect to the previous setting to attain to a real use case scenario with an acceptable accuracy (>70%).Overall, the post-training quantization works in a lossless manner on our few-shot problem.On the ResNet15, batch normalization layers are quantized individually because the current version of the tool cannot fold them into the subsequent convolution layers (other models feature a reverse order of these layers).In this configuration, 8-bit quantization initially caused an accuracy drop with respect to the original model that is fully recovered by casting the BatchNorm layer quantization to FP16.
Table 3 also reports the latency measured on-chip when the clock frequency is set at 240 MHz (at a supply voltage of 0.65 V).In particular, we notice the DNN inference speed of ResNet15 is 2.25Â faster than DS-CNN-L irrespective of the 8.4Â higher number of MAC operations.This is due to the variable execution efficiency of different layer types on the GAP9 HW convolution accelerator.While a 3Â3 2-D convolution layer inside ResNet15 can reach a peak efficiency of up to 100 MAC operations per clock cycle (MAC/cyc), the pointwise layers of the DS-CNN are limited to 40 MAC/cyc.In the latter model, the computation of the embedding vector with a LayerNorm operator is also impacting the total time, constituting 44% of the workload on the cluster cores (Emb Compute in the table).This workload includes the computation of the mean and standard deviation of the feature map values before the average pooling, which may be reduced by further optimization.On the contrary, ResNet15 avoids such a cost by using BatchNormalization before the pooling layer.Considering a power consumption of 25.5 mW, 7 the solution using the largest model is 8.7Â more energy hungry than running TC-ResNet8 but þ10% more accurate and with a 0.66% lower FAR on the considered benchmark.

COMPARISON WITH OTHER WORKS
Zhang et al. 2 showed an MCU implementation of lightweight DNN models for KWS, including the DS-CNN backbones.When trained end to end on the GSC dataset, these models report an accuracy of >94% on the 12 categories.Vygon and Mikhaylovskiy 4 proposed training a ResNet15 using the triplet loss on the same GSC dataset, achieving a top performance of >98% accuracy when using a k-nearest neighbors (KNN) as a classifier.A similar approach is adopted by Huh et al. 5 but proposing a centroid-based classifier or a support vector machine (SVM) trained with all of the positive samples of the train set.In this work, we also adopt the triplet loss to train our tiny DNN feature extractors but on the MSWC dataset, which is disjoined from the test set.Additionally, our prototype-based classifier is initialized with only a few examples and features a lower memory cost than the KNN and the SVM, which have to store all of the reference examples or the support vectors.In the context of few-shot learning for KWS, many works relied on the prototypical network (ProtoNet) approach. 9Similarly to our proposed approach, ProtoNet uses episodic training and prototype-based classification, but cross entropy is adopted as the optimization criterion.Unfortunately, several precedent studies fetched train and test data from the same domain (typically GSC) or did not analyze the FAR of the classifier on negative data, reporting only the accuracy over a test set of target keywords.Additionally, no work in this class focused on TinyML models and their implementation.
Another set of works, denoted as "query by example," leverages the embeddings extracted using a recurrent neural network (RNN) to individuate a wake word.As an example, the work by Chen et al. 8 trained a two-layer long short-term memory and proposed using a similarity score to compare the last embedding state with a reference vector of the target keyword.RNN methods are, however, expensive at inference time because of the memory-bound workload, which is up to 28 times less efficient than running convolutional models on our target HW. 10 A design of this class has been prototyped instead on an Hi3516EV200 development board with an ARM Cortex-A7 processor running at 900 MHz and 32 MB of memory under a power envelope of 100 mW, which is much higher than the capacity of MCU devices. 11cently, Mazumder et al. 6 used transfer learning to train a DNN-based classifier with few shots.This approach places a linear classifier on top of a backbone model trained on the multilingual MSWC dataset.The authors employed an EfficientNet model with 11 million parameters as the backbone and trained the classifier to distinguish a single target keyword from the unknown and background (silence) classes.
Table 4 compares the accuracy scores at FAR ¼ 5% with three, five, and 10 shots and 10 classes with the results of other methods.In Table 4, we also report the ACC and FAR when c ¼ 0. For comparison purposes, we applied a transfer learning approach to the DS-CNN-L model in a few-shot regime.In this case, we train the DS-CNN-L backbone on the MSWC dataset featuring a linear classifier for classifying the 500 keyword classes.At test time, a new classifier is trained on the frozen backbone for eight epochs.Our experimental results show that our proposed approach is both more accurate than a transfer-learning-based solution and does not require any training operation on the target dataset, in opposition to preceding methods. 2,4,5In particular, we notice that the classifier trained using transfer learning is more prone to false positives in an open-set setting: the accuracy drop when tuning the threshold c to achieve the target FAR of 5% .

CONCLUSION
This work showed the effectiveness of TinyML models for KWS classification with few-shot without any backpropagation-based learning.As a top result, we showed a ResNet15 model able to score 80% at an FAR of 5% using only 10 samples per class not seen during training.This model runs efficiently on an MCU device in fewer than 10 ms under a power envelope of 25 mW.We believe this result motivates an increasing research effort toward the development of fast customization techniques for TinyML devices.Future work will also address adaptation strategies to bridge the accuracy gap with respect to models trained end to end on the target dataset.
same class; conversely, embeddings from different categories are kept distant in the feature space.After quantizing and deploying the KWS feature extractor on an MCU-based sensor system, a classifier of voice commands is initialized simply by recording one or more examples for every user-defined speech command.At inference time, KWS classification is based on the distance between the output feature vector of the current sample and the class prototypes.This article makes the following contributions:

FIGURE 1 .
FIGURE 1. Keyword spotting (KWS) customization using fewshot utterances.After training a DNN feature extractor, the class prototypes are computed on-device and used for classification.

Figure 4
gives insights of the representation learned by one of the trained ResNet15 models.The test embeddings obtained from the GSCþ utterances are plotted on a 2-D space using t-SNE (upper part of the figure), together with a subset of unknown samples (the black points not clustered).Some classes, e.g., stop, yes, and no, feature a high accuracy because of the compact representation and the nonoverlap with adjacent classes.Conversely, other keywords, such as up, off , and on, suffer from a scattered representation, leading to a perclass accuracy of <50% when c ¼ m.As can be seen from the confusion matrix in the lower part of the figure, the model predicts a large part of the positive samples as unknowns, while critical errors, i.e., the wrong target class, are a minimal part.

FIGURE 4 .
FIGURE 4. (Top) The t-distributed stochastic neighbor embedding representation of the test sample embeddings obtained from the ResNet15 model.The ACC score is reported close to every class cluster.(Bottom) Confusion matrix obtained when c ¼ m.

TABLE 1 .
Deep neural network feature extractor for tiny machine learning keyword spotting.

TABLE 2 .
Accuracy for 10-keyword classification with a varying number of examples at false acceptance rates of 5% and 1%.The bold values refer to the best scores.FAR: false acceptance rate.

TABLE 3 .
Accuracy of the quantized models and deployment costs on the GAP9 MCU at 240 MHz and powered at 0.65 V.

TABLE 4 .
Comparison with other methods.
Bold refers to the best scores.KNN: k-nearest neighbors; SVM: support vector machine.TinyML 56 IEEE Micro November/December 2023Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.