Skip to Main Content
In this paper, we introduce a new model called Latent Mixture of Discriminative Experts which can automatically learn the temporal relationship between different modalities. Since, we train separate experts for each modality, LMDE is capable of improving the prediction performance even with limited amount of data. For model interpretation, we present a sparse feature ranking algorithm that exploits L1 regularization. An empirical evaluation is provided on the task of listener backchannel prediction (i.e., head nod). We introduce a new error evaluation metric called User-adaptive Prediction Accuracy that takes into account the difference in people's backchannel responses. Our results confirm the importance of combining five types of multimodal features: lexical, syntactic structure, part-of-speech, visual and prosody. Latent Mixture of Discriminative Experts model outperforms previous approaches.