Abstract:
This paper extends Whisper’s automatic speech recognition (ASR) capabilities to perform speech-based emotion recognition (SER) by incorporating word-level emotion classif...Show MoreMetadata
Abstract:
This paper extends Whisper’s automatic speech recognition (ASR) capabilities to perform speech-based emotion recognition (SER) by incorporating word-level emotion classification alongside ASR output. We generate four emotion pseudo-labels (neutral, happy, sad, angry) for each word using a pretrained frame-level SER model, and Whisper is fine-tuned for joint ASR and emotion classification at the word level. Sentence-level emotion labels are masked during training to encourage the transformer to use the ASR output for word-level emotion prediction. During inference, word-level predictions are combined with sentence-level predictions through majority voting to generate the final sentence-level label. When evaluated on the IEMOCAP dataset, our method maintains Whisper’s ASR word error rate while improving the SER weighted accuracy from 74.4% to 76.4% and the unweighted average recall from 77.1% to 79.0%.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: