Skip to Main Content
Historically, the development of computer interfaces has been a technology-driven phenomenon. However, new multimodal interfaces are composed of recognition-based technologies that must interpret human speech, gesture, gaze, movement patterns, and other complex natural behaviors, which involve highly automatized skills that are not under full conscious control. As a result, it now is widely acknowledged that multimodal interface design requires modeling of the modality-centered behavior and integration patterns upon which multimodal systems aim to build. This paper summarizes research on the cognitive science foundations of multimodal interaction, and on the essential role that user-centered modeling has played in prototyping, guiding, and evaluating the design of next-generation multimodal interfaces. In particular, it discusses the properties of different modalities and the information content they carry, the unique features of multimodal language and its processability, as well as when users are likely to interact multimodally and how their multimodal input is integrated and synchronized. It also reviews research on typical performance and linguistic efficiencies associated with multimodal interaction, and on the user-centered reasons why multimodal interaction minimizes errors and expedites error handling. In addition, this paper describes the important role that selective methodologies and evaluation metrics have played in shaping next-generation multimodal systems, and it concludes by highlighting future directions for designing a new class of adaptive multimodal-multisensor interfaces.