Abstract:
Automatic accent classification is an active research field concerning speech processing. It can be useful to identify a speaker's region of origin, which can be applied ...Show MoreMetadata
Abstract:
Automatic accent classification is an active research field concerning speech processing. It can be useful to identify a speaker's region of origin, which can be applied in police investigations carried out by Law Enforcement Agencies, as well as for the improvement of current speech recognition systems. This article presents a novel descriptor called Grad-Transfer, extracted using the Gradient-weighted Class Activation Mapping (Grad-CAM) method based on convolutional neural network (CNN) interpretability. Additionally, we propose a methodology for accent classification that implements Grad-Transfer, which is based on transferring the knowledge acquired by a CNN to a classical machine learning algorithm. The article works on two hypotheses: the coarse localization maps produced by Grad-CAM on spectrograms are able to highlight the regions of the spectrograms that are important for predicting accents, and Grad-Transfer descriptors computed from audios represent distinctive descriptions of the target accents. These hypotheses were demonstrated experimentally, clustering the generated Grad-Transfer descriptors according to the original accent of the audios using Birch and k-means algorithms. We carried out experiments on the Voice Cloning Toolkit dataset, seeing an increase of macro average accuracy, and unweighted average recall in the results obtained by a Gaussian Naive Bayes classifier up to 23.00%, and 23.58%, respectively, compared to a model trained with spectrograms. This demonstrates that Grad-Transfer is able to improve the performance of accent classification models and opens the door to new implementations in similar tasks.
Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 31)