Automatic Audio Feature Extraction for Keyword Spotting

The accuracy and computational complexity of keyword spotting (KWS) systems are heavily influenced by the choice of audio features in speech signals. This letter introduces a novel approach for audio feature extraction in KWS by leveraging a convolutional autoencoder, which has not been explored in the existing literature. Strengths of the proposed approach are in the ability to automate the extraction of the audio features, keep its computational complexity low, and allow accuracy values of the overall KWS systems comparable with the state of the art. To evaluate the effectiveness of our proposal, we compared it with the widely-used Mel Frequency Cepstrum (MFC) method in terms of classification metrics in noisy conditions and the number of required operators, using the public Google speech command dataset. Results demonstrate that the proposed audio feature extractor achieves an average classification accuracy on 12 classes ranging from 81.84% to 90.36% when the signal-to-noise ratio spans from 0 to 40 dB, outperforming the MFC up to 5.2%. Furthermore, the required number of operations is one order of magnitude lower than that of the MFC, resulting in a reduction in computational complexity and processing time, which makes it well-suited for integration with KWS systems in resource-constrained edge devices.

Abstract-The accuracy and computational complexity of keyword spotting (KWS) systems are heavily influenced by the choice of audio features in speech signals.This letter introduces a novel approach for audio feature extraction in KWS by leveraging a convolutional autoencoder, which has not been explored in the existing literature.Strengths of the proposed approach are in the ability to automate the extraction of the audio features, keep its computational complexity low, and allow accuracy values of the overall KWS systems comparable with the state of the art.To evaluate the effectiveness of our proposal, we compared it with the widely-used Mel Frequency Cepstrum (MFC) method in terms of classification metrics in noisy conditions and the number of required operators, using the public Google speech command dataset.Results demonstrate that the proposed audio feature extractor achieves an average classification accuracy on 12 classes ranging from 81.84% to 90.36% when the signal-to-noise ratio spans from 0 to 40 dB, outperforming the MFC up to 5.2%.Furthermore, the required number of operations is one order of magnitude lower than that of the MFC, resulting in a reduction in computational complexity and processing time, which makes it well-suited for integration with KWS systems in resource-constrained edge devices.Index Terms-Autoencoder, edge computing, keyword spotting, neural network, speech feature extraction.

I. INTRODUCTION
K EYWORD Spotting (KWS) has become a crucial topic in recent years as it allows voice recognition systems to be more responsive, saves energy, and improves privacy and data security in edge computing contexts [1], [2], [3].The conventional KWS pipeline consists of three main building blocks: a preprocessing stage to adapt the microphone output to audio processing systems, a feature extraction block, and a Neural Network (NN)-based classifier.The extraction of feature information is crucial for audio classification as it directly affects identification accuracy as well as the overall computational complexity of the system.The authors are with the Department of Industrial Engineering, University of Salerno, 84084 Fisciano, Salerno, Italy (e-mail: pvitolo@unisa.it;rliguori@ unisa.it;ldibenedetto@unisa.it;arubino@unisa.it;gdlicciardo@unisa.it).
Digital Object Identifier 10.1109/LSP.2023.3346280require Fourier Transforms (FT) in various processing stages [4], [5].Several researchers have recently delved into investigating speech feature lightening approaches, such as quantizing speech feature maps [6] or reducing MFCC feature matrices [7], [8].However, these methods still rely on computationally complex FT processes.The aim of this work is to demonstrate the advantages of using Auto-Encoders (AE) [9], [10], [11] in implementing an automated data-driven approach for audio feature extraction.The main advantages of the proposed approach are as follows: r Automated feature extraction, eliminating the need for an expert to choose the best features and set the extractor parameters appropriately; r Reduction in the number of required operators, leading to lower power consumption and processing time; r Potential integration of the proposed extractor with the other NN-based blocks of KWS systems (e.g. with PDMto-PCM converter [12] and NN-based classifier [13]), thereby creating a compact NN-based end-to-end KWS system.Advantages of the proposed AE-based feature extractor stem from comparisons with the conventional MFCCs, which are the most used features in state-of-the-art KWS systems [14], [15].Several aspects have been considered: the number of operations required, classification metrics when used for KWS classification, and noise behavior.The KWS classifier and the dataset used to evaluate the feature performance have been the model described in the MLCommons/tiny benchmarking system [13], [16] and the public Google Speech Command Dataset (GSCD) [17].In absence of noise, the KWS classifier that uses the proposed extracted features achieves an accuracy of 90.36%, which is comparable to the counterpart.However, the proposed approach exhibits a 5.2% higher accuracy when the Signal-to-Noise Ratio (SNR) equals 0, making it more robust in noisy environments.Additionally, the number of operators of the proposed approach is one order of magnitude lower than that of the MFCC method, resulting in a reduction in computational complexity and required processing time.

II. THE PROPOSED AUDIO FEATURE EXTRACTION
Fig. 1 illustrates the proposed setup to realize the automatic audio feature extractor, where an encoder and decoder are combined to form an AE.The primary function of the encoder is to extract a compressed low-dimensional representation of the  input data, while the decoder aims to reconstruct the input using the features extracted by the encoder.The AE is trained with the objective of minimizing the loss function between the input and output.In this way, while the AE learns to reconstruct the input, the encoder learns to automatically derive the most informative features from the input signal.Once the training is completed, the decoder can be discarded and the encoder alone is used as a feature extractor as shown in Fig. 2.

A. Neural Network Architecture
The architecture of the proposed AE is depicted in Fig 3.
The AE consists of two main components: the encoder, which serves as a feature extractor and dimensional reducer, and the decoder, responsible for reconstructing the input from the extracted features.During training, the network tries to minimize input-output errors.This iterative process enables the network to autonomously learn robust features, effectively mitigating the impact of noise while retaining crucial information.The proposed model is based on a Convolutional Neural Network (CNN), chosen because of its efficiency and effectiveness in finding local spatial coherences, and its reduced number of parameters and required operations compared to other models [18], TABLE I SUMMARY OF THE PROPOSED MODEL [19], [20].While CNNs are widely known for their success in image processing [19], they have also shown promising results in speech-related applications, such as speech recognition, speaker identification, and emotion recognition [21].While significant progress has been made in developing automated neural architecture search in recent years [22], identifying the optimal architecture remains a challenging task that continues to rely on specific case studies and expert knowledge.In this work, the proposed AE model has been designed through an iterative trial-and-error process, involving adjustments to the number of layers, the number of neurons per layer, and the hyperparameter values.The proposed encoder consists of four convolutional layers (CONV2D) that act as filters, followed by max-pooling layers (MaxPool) to reduce the input dimensionality.Conv2D_1, Conv2D_2, Conv2D_3, and Conv2D_4 have a kernel size of (32 × 1), (8 × 8), (16 × 16), and (8 × 8), respectively, while their channels are 32, 16, 2, and 2. The strides are equal to 1, the padding is set to "same", and the activation functions have been the hyperbolic tangent (1).The pool size of the first three pooling layers is (4 × 4), while the size of the last one is (5 × 4).
The decoder is composed of 1 upsampling layer of (5 × 1) size and 2 transposed convolution layers with 2 channels, (1) as activation function, and a kernel size of (16 × 16).Table I reports the architecture of the proposed AE, detailing the layers, number of parameters, and output shapes.To assess the effectiveness of the proposed feature extractor (Fig. 2 right) in comparison to the traditional MFCC extraction (Fig. 2 left), a keyword classifier has been trained using features obtained from both extractors.The chosen classifier follows the architecture proposed in the MLCommons/tiny benchmarking system [13], [16].It consists of the following layers: r One 2D convolutional layer with a kernel size of 10 × 4 and 64 channels.
r One 2D max pooling layer.r One dense layer with 12 neurons.Each DSConv2D layer is composed of a depthwise 2D convolutional layer with a kernel size of 3 × 3, followed by a 2D Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.convolutional layer with a kernel size of 1 × 1 and 64 channels.The activation function used for the convolutional layers is the ReLU function ( 2), while the softmax function ( 3) is applied to the last layer to obtain probabilities for class membership.

B. Dataset
The GSCD referenced in [17] has been used to train and evaluate the proposed feature extraction system.This dataset consists of 105829 utterances of 35 words.Each utterance is a one-second (or less) WAVE file, encoded with 16-bit singlechannel PCM values and a sampling rate of 16 kHz [17].Twelve specific classes have been selected from the dataset, following the approach used in the MLCommons/tiny benchmarking system [13], [16].The chosen classes are: "Yes," "No," "Up," "Down," "Left," "Right," "On," "Off," "Stop," "Go," "Background," and "Unknown."The "Background" class comprises one-second clips randomly extracted from background noise audio files, while the "Unknown" class consists of randomly sampled words from the remaining classes.To ensure uniformity in the dataset, recordings shorter than 1 s have been zero-padded.Additionally, to create a balanced dataset, the same number of words has been selected for every class.As the class "Up" contains the fewest words, i.e., 3723 words [17], this number has been used for each class, resulting in a total dataset size of 44676 words.Additive white gaussian noise has been employed to evaluate the noisy robustness of the proposed approach, as it represents the lowest performance bound among all noise types [23].The noise has been added to each utterance in the dataset, for a SNR ranging from 0 to 40 dB.

C. Training
The proposed autoencoder and the classifier proposed by [13], [16] described in Section II-A have been modeled and trained in Python language using TensorFlow (TF) framework [24].For the autoencoder, two loss functions were employed: the Mean Absolute Error (MAE) and the Fast-Fourier-Transform Mean Absolute Error (FFT-MAE).The FFT-MAE is a custom loss function introduced in [25], [26].As described by ( 4), it returns the mean absolute error between the FFT of the model outputs and the FFT of the corresponding labels.The model has been initially trained with MAE as the loss function.Subsequently, the model has been fine-tuned by using FFT-MAE.
The dataset has been divided into training (80%), validation (10%), and test (10%) sets.The batch sizes and the number of epochs have been set to 64 and 100, respectively.Early stopping has been used to reduce overfitting, setting 0.6 patience and monitoring the validation loss.The performance of the proposed autoencoder has been evaluated in terms of MAE, Mean Square Error (MSE), and FFT-MAE.For the classifier, the dataset has been divided into training (80%) and test (20%) sets.A 4-fold cross-validation has been used to reduce overfitting and estimate the generalization performance of the model on different splits of the training dataset.The model has been trained with batch sizes of 256 for 200 epochs.Sparse categorical cross-entropy has been chosen as the loss function, Adam as the optimizer, and accuracy as the metric.The performance of the model has been evaluated in terms of accuracy, precision, recall, and f-score.

III. RESULTS
The results of the autoencoder training show a MAE of 0.0338, a MSE of 0.00828, and FFT-MAE of 2.8 on the test set.Fig. 4 reports the classification metrics on the validation set of the classifier trained with the proposed extracted features for each fold.The accuracy averaged on the 4 folds of the classifier using the proposed features is 89.69%, with a standard deviation of 0.57, while the precision, recall, and f-score are 89.82%(0.59), 89.69% (0.51), 89.67% (0.58), respectively.These small standard deviation values ensure the robustness of the proposed model to various splitting data ensuring the reliability of the results.As can be seen from the confusion matrix in Fig. 5, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the classifier using the proposed features achieves an average accuracy, precision, recall, and f-score on the test set of 90.36%, 90.51%, 90.36%, and 90.35%, respectively.The proposed approach has been compared to the following state-of-the-art features: mel spectrogram, MFCC, and PNCC [5], [27], [28].As can be seen in Table II, the classifier using the proposed features achieves an accuracy approximately 2.07% greater than that based on PNCC while 0.68% and 0.33% lower than that obtained with MFCC and mel spectrogram, respectively.As can be seen from the confusion matrix on the test set in Fig. 6, the MFCC classifier obtains the best performance with an accuracy average over the 12 classes of 91.04% and a standard deviation of 0.045.However, as shown in Fig 7(a), the average accuracy of the MFCC classifier drops faster than the accuracy of the proposed solution as the SNR decreases.Specifically, with an SNR of 0 dB, the classifier using MFCC achieves an accuracy of 76.64%, which is 5.2% lower than the accuracy obtained

IV. CONCLUSION
This letter introduces a novel data-driven method using a compact AE for automated audio feature extraction.The evaluation in a KWS system, against MFCCs and using GSCD, has shown its advantages in accuracy, noise behaviour, and computational efficiency, ideal for resource-constrained devices.Future works will aim to integrate the proposed extractor into an end-to-end NN-based KWS system and design edge computing-specific hardware.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 1 .
Fig. 1.Block diagram of the proposed automatic audio feature extractor.Network training minimizes input-output errors, allowing autonomous learning of robust features, reducing noise impact while preserving crucial information.

Fig. 2 .
Fig. 2. Block Diagrams of the traditional mel-scale-based (left) and proposed NN-based (right) feature extraction systems.

Fig. 3 .
Fig. 3. Architecture of the proposed convolutional autoencoder.It consists of an encoder (left) and a decoder (right) part.The encoder acts as a feature extractor and dimensional reducer, while the decoder reconstructs the input from the extracted feature.

Fig. 5 .
Fig. 5. Confusion matrix on the test set of the proposed extractor-based classifier.

Fig. 6 .
Fig. 6.Confusion matrix on the test set of the classifier using the MFCCs.

Fig. 7 .
Fig. 7. Accuracy trends of the MFCC-based classifier and the proposed extractor-based classifier as SNR varies.
. The proposed solution requires 2800 additions and 2688 multiplications per kernel size, which are about one order of magnitude lower than the counterpart (14192 additions and 12529 multiplications), and does not require the calculation of the Log function.The overall advantage of the proposed approach is shown in Fig7(b), where the accuracy weighted for the number of required operations per frame of the proposed approach outperforms the MFCC for each SNR value.Therefore, the proposed solution can replace the resource-hungry FT-based feature extractor for more lightweight audio processing.These results pave the way to the design of a complete pipeline of neural networks, forming an end-to-end KWS system by merging the proposed feature extractor with the NN-based PDM-to-PCM converter in[12] and a CNN-based classifier.
Automatic Audio Feature Extraction for Keyword Spotting Paola Vitolo , Graduate Student Member, IEEE, Rosalba Liguori , Member, IEEE, Luigi Di Benedetto , Alfredo Rubino , and Gian Domenico Licciardo , Senior Member, IEEE It represents a computationally complex block as it heavily relies on mel-scale-related techniques, such as Mel-Frequency Cepstral Coefficients (MFCCs), mel spectrogram, and Power Normalized Cepstral Coefficients (PNCCs), which Manuscript received 4 August 2023; revised 24 November 2023; accepted 17 December 2023.Date of publication 22 December 2023; date of current version 8 January 2024.This work was supported by the Italian Ministry of Education, University and Research.The associate editor coordinating the review of this manuscript and approving it for publication was Mr. Ville M. Hautamäki.(Corresponding author: Gian Domenico Licciardo.)

TABLE II ACCURACY
ON THE TEST SET OF THE DIFFERENT METHODS

TABLE III OPERATIONS
PER FRAME REQUIRED BY THE PROPOSED FEATURE EXTRACTOR [29]MFCC by our classifier.TableIIIprovides a comparison between the number of operations per frame required by the proposed feature extractor and the best one of TableII, MFCC calculated as shown in[29]