Skip to Main Content
Most research efforts dealing with recognition of emotion-related states from the human speech signal concentrate on acoustic analysis. However, the last decade's research results show that the task cannot be solved to complete satisfaction, especially when it comes to real life speech data and in particular to the assessment of speakers' valence. This paper therefore investigates novel approaches to the additional exploitation of linguistic information. To ensure good applicability to the real world, spontaneous speech and nonacted nonprototypical emotions are examined in the recently popular dimensional model in 3D continuous space. As there is a lack of linguistic analysis approaches and experiments for this model, various methods are proposed. Best results are obtained with the described bag of n-gram and character n-gram approaches introduced for the first time for this task and allowing for advanced vector space representation of the spoken contents. Furthermore, string kernels are considered. By early fusion and combined space optimization of the proposed linguistic features with acoustic ones, the regression of continuous emotion primitives outperforms reported benchmark results on the VAM corpus of highly emotional face-to-face communication.