Loading [MathJax]/extensions/MathMenu.js
Extending Whisper for Emotion Prediction Using Word-level Pseudo Labels | IEEE Conference Publication | IEEE Xplore

Extending Whisper for Emotion Prediction Using Word-level Pseudo Labels


Abstract:

This paper extends Whisper’s automatic speech recognition (ASR) capabilities to perform speech-based emotion recognition (SER) by incorporating word-level emotion classif...Show More

Abstract:

This paper extends Whisper’s automatic speech recognition (ASR) capabilities to perform speech-based emotion recognition (SER) by incorporating word-level emotion classification alongside ASR output. We generate four emotion pseudo-labels (neutral, happy, sad, angry) for each word using a pretrained frame-level SER model, and Whisper is fine-tuned for joint ASR and emotion classification at the word level. Sentence-level emotion labels are masked during training to encourage the transformer to use the ASR output for word-level emotion prediction. During inference, word-level predictions are combined with sentence-level predictions through majority voting to generate the final sentence-level label. When evaluated on the IEMOCAP dataset, our method maintains Whisper’s ASR word error rate while improving the SER weighted accuracy from 74.4% to 76.4% and the unweighted average recall from 77.1% to 79.0%.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

Funding Agency:


References

References is not available for this document.