Loading web-font TeX/Main/Regular
SkelETT—Skeleton-to-Emotion Transfer Transformer | IEEE Journals & Magazine | IEEE Xplore
Our proposal comprises two learning phases: on the left, unsupervised pre-training, where a transformer model captures complex spatial and temporal patterns in human skel...

Abstract:

Emotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health as...Show More

Abstract:

Emotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health assessment and affective computing. Traditionally, this field has heavily relied on visual and auditory cues, such as facial expressions and speech analysis. However, these modalities alone may not comprehensively capture the full spectrum of human emotion and suffer limitations due to noise or occlusion. Human skeletons, derived from depth sensors or pose estimation algorithms, offer an alternative for facial expression, including valuable spatial and temporal cues. In this paper, we introduce a novel approach to emotion recognition by pre-training a transformer model on a large dataset of unsupervised human skeleton representations and subsequently fine-tuning it for emotion classification. By exposing the model to an extensive corpus of unlabeled human skeleton data, we can effectively learn to represent complex spatial and temporal dependencies inherent in body movements. Following this foundational knowledge acquisition, the model undergoes fine-tuning on a smaller, labeled dataset tailored for emotion classification tasks. We introduce SkelETT, an encoder-only transformer architecture for body emotion recognition. Comprising a series of encoder layers, SkelETT patches 2D body pose representations, it also includes multi-head self-attention mechanisms and position-wise feed-forward networks, providing a powerful framework for extracting hierarchical features from sequential body pose data. We propose and evaluate the impact of different fine-tuning strategies on pose data using the MPOSE action recognition dataset as a pre-training source. Transfer performance is measured on the BoLD body emotion recognition dataset. Compared to the state-of-the-art, we report significant gains in accuracy ( \approx ~34 % higher), training time ( \approx ~50 % less), and model complexity reduction ( $\approx ...
Our proposal comprises two learning phases: on the left, unsupervised pre-training, where a transformer model captures complex spatial and temporal patterns in human skel...
Published in: IEEE Access ( Volume: 13)
Page(s): 23344 - 23358
Date of Publication: 24 January 2025
Electronic ISSN: 2169-3536

Funding Agency:


References

References is not available for this document.