Abstract:
This paper proposes a novel speech emotion recognition (SER) method that fully leverages the architecture of Whisper, a large-scale automatic speech recognition (ASR) mod...Show MoreMetadata
Abstract:
This paper proposes a novel speech emotion recognition (SER) method that fully leverages the architecture of Whisper, a large-scale automatic speech recognition (ASR) model. The conventional SER models using a pre-trained speech encoder may fail to capture linguistic content since their decoders are too simple. Our proposed method addresses this shortcoming by adopting the decoder of Whisper, which has been discarded in conventional SER, to leverage its language modeling capability. The proposed method introduces special tokens corresponding to the target emotions and then fine-tunes the entire Whisper model. Furthermore, we also propose a new training scheme suitable for Whisper, named serialized multi-task learning (SerialMTL), to consider various speech information as context for the objective SER task. In SerialMTL, the model initially predicts subtask tokens, such as transcription and gender tokens, and then estimates the emotion token. An advantage of the proposed method is the simplicity of the model structure, even when adding any new subtasks. Experimental results show that our model, based on the entire Whisper, achieves better SER performance than the conventional model and further improves with SerialMTL training via ASR and gender recognition subtasks.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: