Skip to Main Content
The usage patterns of speech and visual input modes are investigated as a function of relative input mode efficiency for both desktop and personal digital assistant (PDA) working environments. For this purpose the form-filling part of a multimodal dialogue system is implemented and evaluated; three multimodal modes of interaction are implemented: ldquoClick-to-Talk,rdquo ldquoOpen-Mike,rdquo and ldquoModality-Selection.rdquo ldquoModality-Selectionrdquo implements an adaptive interface where the system selects the most efficient input mode at each turn, effectively alternating between a ldquoClick-to-Talkrdquo and ldquoOpen-Mikerdquo interaction style as proposed in ldquoModality tracking in the multimodal Bell Labs Communicator,rdquo in Proceedings of the Automatic Speech Recognition and Understanding Workshop, by A. Potamianos, , 2003. The multimodal systems are evaluated and compared with the unimodal systems. Objective and subjective measures used include task completion, task duration, turn duration, and overall user satisfaction. Turn duration is broken down into interaction time and inactivity time to better measure the efficiency of each input mode. Duration statistics and empirical probability density functions are computed as a function of interaction context and user. Results show that the multimodal systems outperform the unimodal systems in terms of objective and subjective criteria. Also, users tend to use the most efficient input mode at each turn; however, biases towards the default input modality and a general bias towards the speech modality also exists. Results demonstrate that although users exploit some of the available synergies in multimodal dialogue interaction, further efficiency gains can be achieved by designing adaptive interfaces that fully exploit these synergies.