Abstract:
Speech emotion recognition (SER) technology has recently become a trend in a broader field and has achieved remarkable recognition performances using deep learning techni...Show MoreMetadata
Abstract:
Speech emotion recognition (SER) technology has recently become a trend in a broader field and has achieved remarkable recognition performances using deep learning technique. However, the recognition performances obtained using end-to-end learning directly from raw audio waveform still hardly exceed those based on hand-crafted acoustic descriptors. Instead of solely rely on raw waveform or acoustic descriptors for SER, we propose an acoustic space augmentation network, termed as Dual-Complementary Acoustic Embedding Network (DCaEN), that combines knowledge-based features with raw waveform embedding learned with a novel complementary constraint. DCaEN includes representations from eGeMAPS acoustic feature and raw waveform by specifying a negative cosine distance loss to explicitly constrain the raw waveform embedding to be different from eGeMAPS. Our experimental results demonstrate an improved emotion discriminative power on the IEMOCAP database, which achieves 59.31% in a four class emotion recognition. Our analysis also demonstrates that the learned raw waveform embedding of DCaEN converges close to reverse mirroring of the original eGeMAPS space.
Published in: 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII)
Date of Conference: 03-06 September 2019
Date Added to IEEE Xplore: 09 December 2019
ISBN Information: