Loading [a11y]/accessibility-menu.js
Speech Emotion Recognition Based on Large-Scale Automatic Speech Recognizer | IEEE Conference Publication | IEEE Xplore

Speech Emotion Recognition Based on Large-Scale Automatic Speech Recognizer


Abstract:

This paper proposes a novel speech emotion recognition (SER) method that fully leverages the architecture of Whisper, a large-scale automatic speech recognition (ASR) mod...Show More

Abstract:

This paper proposes a novel speech emotion recognition (SER) method that fully leverages the architecture of Whisper, a large-scale automatic speech recognition (ASR) model. The conventional SER models using a pre-trained speech encoder may fail to capture linguistic content since their decoders are too simple. Our proposed method addresses this shortcoming by adopting the decoder of Whisper, which has been discarded in conventional SER, to leverage its language modeling capability. The proposed method introduces special tokens corresponding to the target emotions and then fine-tunes the entire Whisper model. Furthermore, we also propose a new training scheme suitable for Whisper, named serialized multi-task learning (SerialMTL), to consider various speech information as context for the objective SER task. In SerialMTL, the model initially predicts subtask tokens, such as transcription and gender tokens, and then estimates the emotion token. An advantage of the proposed method is the simplicity of the model structure, even when adding any new subtasks. Experimental results show that our model, based on the entire Whisper, achieves better SER performance than the conventional model and further improves with SerialMTL training via ASR and gender recognition subtasks.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

Contact IEEE to Subscribe

References

References is not available for this document.