By Topic

Multimodal Interfaces, 2002. Proceedings. Fourth IEEE International Conference on

Date 16-16 Oct. 2002

Filter Results

Displaying Results 1 - 25 of 89
  • Proceedings Fourth IEEE International Conference on Multimodal Interfaces

    Save to Project icon | Request Permissions | PDF file iconPDF (327 KB)  
    Freely Available from IEEE
  • Author index

    Page(s): 541 - 543
    Save to Project icon | Request Permissions | PDF file iconPDF (186 KB)  
    Freely Available from IEEE
  • Covariance-tied clustering method in speaker identification

    Page(s): 81 - 84
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (350 KB) |  | HTML iconHTML  

    Gaussian mixture models (GMMs) have been successfully applied to the classifier for speaker modeling in speaker identification. However, there are still problems to solve, such as the clustering methods. The conditional k-means algorithm utilizes Euclidean distance taking all data distribution as sphericity, which is not the distribution of the actual data. In this paper we present a new method making use of covariance information to direct the clustering of GMMs, namely covariance-tied clustering. This method consists of two parts: obtaining covariance matrices using the data sharing technique based on a binary tree, and making use of covariance matrices to direct clustering. The experimental results prove that this method leads to worthwhile reductions of error rates in speaker identification. Much remains to be done to explore fully the covariance information. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Individual differences in facial expression: stability over time, relation to self-reported emotion, and ability to inform person identification

    Page(s): 491 - 496
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (285 KB) |  | HTML iconHTML  

    The face can communicate varied personal information including subjective emotion, communicative intent, and cognitive appraisal. Accurate interpretation by observer or computer interface depends on attention to dynamic properties of the expression, context, and knowledge of what is normative for a given individual. In two separate studies, we investigated individual differences in the base rate of positive facial expression and in specific facial action units over intervals from 4 to 12 months. Facial expression was measured using convergent measures, including facial EMG, automatic feature-point tracking, and manual FACS coding. Individual differences in facial expression were stable over time, comparable in magnitude to stability of self-reported emotion, and sufficiently strong that individuals were recognized on the basis of their facial behavior alone at rates comparable to that for a commercial face recognition system (Facelt from Identix). Facial action units convey unique information about person identity that can inform interpretation of psychological states, person recognition, and design of individuated avatars. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Context-based multimodal input understanding in conversational systems

    Page(s): 87 - 92
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (333 KB) |  | HTML iconHTML  

    In a multimodal human-machine conversation, user inputs are often abbreviated or imprecise. Sometimes, merely fusing multimodal inputs together cannot derive a complete understanding. To address these inadequacies, we are building a semantics-based multimodal interpretation framework called MIND (Multimodal Interpretation for Natural Dialog). The unique feature of MIND is the use of a variety of contexts (e.g., domain context and conversation context) to enhance multimodal fusion. In this paper we present a semantically rich modeling scheme and a context-based approach that enable MIND to gain a full understanding of user inputs, including ambiguous and incomplete ones. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards vision-based 3-D people tracking in a smart room

    Page(s): 400 - 405
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (749 KB) |  | HTML iconHTML  

    This paper presents our work on building a real time distributed system to track 3D locations of people in an indoor environment, such as a smart room, using multiple calibrated cameras. In our system, each camera is connected to a dedicated computer on which foreground regions in the camera image are detected. This is done using an adaptive background model. These detected foreground regions are broadcasted to a tracking agent, which computes believed 3D locations of persons based on the detected image regions. We have implemented both a best-hypothesis heuristic tracking approach as well as a probabilistic multi-hypothesis tracker to find the object tracks from these 3D locations. The two tracking approaches are evaluated on a sequence of two people walking in a conference room recorded with three cameras. The results suggest that the probabilistic tracker shows comparable performance to the heuristic tracker. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Training a talking head

    Page(s): 499 - 504
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (663 KB) |  | HTML iconHTML  

    A Cyberware laser scan of DWM was made, Baldi's generic morphology was mapped into the form of DWM, this head was trained on real data recorded with Optotrak LED markers, and the quality of its speech was evaluated. Participants were asked to recognize auditory sentences presented alone in noise, aligned with the newly trained synthetic textured mapped target face, or the original natural face. There was a significant advantage when the noisy auditory sentence was paired with either head, with the synthetic textured mapped target face giving as much of an improvement as the original recordings of the natural face. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 3D N-best search for simultaneous recognition of distant-talking speech of multiple talkers

    Page(s): 59 - 63
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (374 KB) |  | HTML iconHTML  

    A microphone array is a promising solution for realizing hands-free speech recognition in real environments. Accurate talker localization is very important for speech recognition using the microphone array. However, localization of a moving talker is difficult in noisy reverberant environments. Talker localization errors degrade the performance of speech recognition. To solve the problem, we proposed a new speech recognition algorithm which considers multiple talker direction hypotheses simultaneously. The proposed algorithm performs Viterbi search in 3-dimensional trellis space composed of talker directions, input frames, and HMM states. In this paper we describe a new simultaneous recognition algorithm for distant-talking speech of multiple talkers using the extended 3D N-best search algorithm. The algorithm exploits path distance-based clustering and a likelihood normalization technique appeared to be necessary in order to build an efficient system for our purpose. We evaluated the proposed method using reverberated data, which are those simulated by the image method and recorded in a real room. The image method was used to find the accuracy-reverberation time relationship, and real data was used to evaluate the real performance of our algorithm. The Top 3 result of simultaneous word accuracy was 73.02% under 162 ms reverberation time using the image method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Experimentally augmenting an intelligent tutoring system with human-supplied capabilities: adding human-provided emotional scaffolding to an automated reading tutor that listens

    Page(s): 483 - 490
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (353 KB) |  | HTML iconHTML  

    We present the first statistically reliable empirical evidence from a controlled study for the effect of human-provided emotional scaffolding on student persistence in an intelligent tutoring system. We describe an experiment that added human-provided emotional scaffolding to an automated Reading Tutor that listens, and discuss the methodology we developed to conduct this experiment. Each student participated in one (experimental) session with emotional scaffolding, and in one (control) session without emotional scaffolding, counterbalanced by order of session. Each session was divided into several portions. After each portion of the session was completed, the Reading Tutor gave the student a choice: continue, or quit. We measured persistence as the number of portions the student completed. Human-provided emotional scaffolding added to the automated Reading Tutor resulted in increased student persistence, compared to the Reading Tutor alone. Increased persistence means increased time on task, which ought lead to improved learning. If these results for reading turn out to hold for other domains too, the implication for intelligent tutoring systems is that they should respond with not just cognitive support-but emotional scaffolding as well. Furthermore, the general technique of adding human-supplied capabilities to an existing intelligent tutoring system should prove useful for studying other ITSs too. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust noisy speech recognition with adaptive frequency bank selection

    Page(s): 75 - 80
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (344 KB) |  | HTML iconHTML  

    With the development of automatic speech recognition technology, the robustness problem of speech recognition systems is becoming more and more important. This paper addresses the problem of speech recognition in an additive background noise environment. Since the frequency energy of different types of noise focuses on different frequency banks, the effects of additive noise on each frequency bank are different. The seriously obscured frequency banks have little word signal information left, and are harmful for subsequence speech processing. Wu and Lin (2000) applied the frequency bank selection theory to robust word boundary detection in a noisy environment, and obtained good detection results. In this paper, this theory is extended to noisy speech recognition. Unlike the standard MFCC which uses all frequency banks for cepstral coefficients, we only use the frequency banks that are slightly corrupted and discard the seriously obscured ones. Cepstral coefficients are calculated only on the selected frequency banks. Moreover, an acoustic model is also adapted to match the modification of the acoustic feature. Experiments on continuous digital speech recognition show that the proposed algorithm leads to better performance than spectral subtraction and cepstral mean normalization at low SNRs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Context-sensitive help for multimodal dialogue

    Page(s): 93 - 98
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (392 KB) |  | HTML iconHTML  

    Multimodal interfaces offer users unprecedented flexibility in choosing a style of interaction. However, users are frequently unaware of or forget shorter or more effective multimodal or pen-based commands. This paper describes a working help system that leverages the capabilities of a multimodal interface in order to provide targeted, unobtrusive, context-sensitive help. This multimodal help system guides the user to the most effective way to specify a request, providing transferable knowledge that can be used in future requests without repeatedly invoking the help system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using TouchPad pressure to detect negative affect

    Page(s): 406 - 410
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (238 KB) |  | HTML iconHTML  

    Humans naturally use behavioral cues in their interactions with other humans. The Media Equation proposes that these same cues are directed towards media, including computers. It is probable that detection of these cues by a computer during run-time could improve usability design and analysis. A preliminary experiment testing one of these cues, Synaptics TouchPad pressure, shows that behavioral cues can be used as a critical incident indicator by detecting negative affect. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards visually-grounded spoken language acquisition

    Page(s): 105 - 110
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (329 KB) |  | HTML iconHTML  

    A characteristic shared by most approaches to natural language understanding and generation is the use of symbolic representations of word and sentence meanings. Frames and semantic nets are examples of symbolic representations. Symbolic methods are inappropriate for applications which require natural language semantics to be linked to perception, as is the case in tasks such as scene description or human-robot interaction. This paper presents two implemented systems, one that learns to generate, and one that learns to understand visually-grounded spoken language. These implementations are part of our on-going effort to develop a comprehensive model of perceptually-grounded semantics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Labial coarticulation modeling for realistic facial animation

    Page(s): 505 - 510
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (581 KB) |  | HTML iconHTML  

    A modified version of the coarticulation model proposed by Cohen and Massaro (1993) is described. A semi-automatic minimization technique, working on real cinematic data, acquired by the ELITE opto-electronic system, was used to train the dynamic characteristics of the model. Finally, the model was applied with success to GRETA, an Italian talking head, and examples are illustrated to show the naturalness of the resulting animation technique. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Active gaze tracking for human-robot interaction

    Page(s): 261 - 266
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (561 KB) |  | HTML iconHTML  

    In our effort to make human-robot interfaces more user-friendly, we built an active gaze tracking system that can measure a person's gaze direction in real-time. Gaze normally tells which object in his/her surrounding a person is interested in. Therefore, it can be used as a medium for human-robot interaction like instructing a robot arm to pick a certain object a user is looking at. We discuss how we developed and put together algorithms for zoom camera calibration, low-level control of active head, face and gaze tracking to create an active gaze tracking system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multimodal dialogue systems for interactive TV applications

    Page(s): 117 - 122
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (236 KB) |  | HTML iconHTML  

    Many studies have shown the advantages of building multimodal systems, but not in the interactive TV application context. This paper reports on a qualitative study of a multimodal program guide for interactive TV. The system was designed by adding speech interaction to an existing TV program guide. Results indicate that spoken natural language input combined with visual output is preferable for TV applications. Furthermore, user feedback requires a clear distinction between the dialogue system's domain result and system status in the visual output. Consequently, we propose an interaction model that consists of three entities: user, domain results, and system feedback. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improved named entity translation and bilingual named entity extraction

    Page(s): 253 - 258
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (313 KB) |  | HTML iconHTML  

    Translation of named entities (NE), including proper names, temporal and numerical expressions, is very important in multilingual natural language processing, like crosslingual information retrieval and statistical machine translation. We present an integrated approach to extract a named entity translation dictionary from a bilingual corpus while at the same time improving the named entity annotation quality. Starting from a bilingual corpus where the named entities are extracted independently for each language, a statistical alignment model is used to align the named entities. An iterative process is applied to extract named entity pairs with higher alignment probability. This leads to a smaller but cleaner named entity translation dictionary and also to a significant improvement of the monolingual named entity annotation quality for both languages. Experimental result shows that the dictionary size is reduced by 51.8% and the annotation quality is improved from 70.03 to 78.15 for Chinese and 73.38 to 81.46 in terms of F-score. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integration of tone related feature for Chinese speech recognition

    Page(s): 64 - 68
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (297 KB) |  | HTML iconHTML  

    Chinese is a tonal language that uses fundamental frequency, in addition to phones for word differentiation. Commonly used front-end features, such as mel-frequency cepstral coefficients (MFCC), however, are optimized for non-tonal languages such as English and are not mainly focused on pitch information that is important for tone identification. In this paper, we examine the integration of tone-related acoustic features for Chinese recognition. We propose the use of the cepstrum method (CEP), which uses the same configurations as in MFCC extraction for the extraction of pitch-related features. The pitch periods extracted from the CEP algorithm can be used directly for speech recognition and do not require any special treatment for unvoiced frames. In addition, we explore a number of feature transformations and find that the addition of a properly normalized and transformed set of pitch related-features can reduce the recognition error rate from 34.61% to 29.45% on the Chinese 1998 National Performance Assessment (Project 863) corpus. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hand tracking using spatial gesture modeling and visual feedback for a virtual DJ system

    Page(s): 197 - 202
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (247 KB) |  | HTML iconHTML  

    The ability to accurately track hand movement provides new opportunities for human computer interaction (HCI). Many of today's commercial hand tracking devices based on gloves can be cumbersome and expensive. An approach that avoids these problems is to use computer vision to capture hand motion. We present a complete real-time hand tracking and 3-D modeling system based on a single camera. In our system, we extract feature points from a video stream of a hand to control a virtual hand model with 2-D global motion and 3-D local motion. The on screen model gives the user instant feedback on the estimated position of the hand. This visual feedback allows a user to compensate for the errors in tracking. The system is used for three example applications. The first application uses hand tracking and gestures to take on the role of the mouse. The second interacts with a 3D virtual environment using the 3D hand model. The last application is a virtual DJ system that is controlled by hand motion tracking and gestures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Embarking on multimodal interface design

    Page(s): 355 - 360
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (379 KB) |  | HTML iconHTML  

    Designers are increasingly faced with the challenge of targeting multimodal applications, those that span heterogeneous devices and use multimodal input, but do not have tools to support them. We studied the early stage work practices of professional multimodal interaction designers. We noted the variety of different artifacts produced, such as design sketches and paper prototypes. Additionally, we observed Wizard of Oz techniques that are sometimes used to simulate an interactive application from these sketches. These studies have led to our development of a technique for interface designers to consider as they embark on creating multimodal applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The NESPOLE! multimodal interface for cross-lingual communication $experience and lessons learned

    Page(s): 223 - 228
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (455 KB) |  | HTML iconHTML  

    We describe the design, evolution, and development of the user interface components of the NESPOLE! speech-to-speech translation system. The NESPOLE! system was designed for users with medium-to-low levels of computer literacy and Web expertise. The user interface was designed to effectively combine Web browsing, real-time sharing of graphical information and multi-modal annotations using a shared whiteboard, and real-time multilingual speech communication, all within an e-commerce scenario. Data collected in sessions with naive users in several stages in the process of system development formed the basis for improving the effectiveness and usability of the system. We describe this development process, the resulting interface components and the lessons learned. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Referring to objects with spoken and haptic modalities

    Page(s): 99 - 104
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (689 KB)  

    The gesture input modality considered in multimodal dialogue systems is mainly reduced to pointing or manipulating actions. With an approach based on spontaneous character of the communication, the treatment of such actions involves many processes. Without constraints, the user may use gesture in association with speech, and may exploit visual context peculiarities, guiding her/his articulation of gesture trajectories and her/his choice of words. Semantic interpretation of multimodal utterances also becomes a complex problem, taking into account varieties of referring expressions, varieties of gestural trajectories, structural parameters from the visual context, and also directives from a specific task. Following the spontaneous approach, we propose to give maximal understanding capabilities to dialogue systems, to ensure that various interaction modes must be taken into account. Considering the development of haptic sense devices (such as PHANToM) which increase the capabilities of sensations, particularly tactile and kinesthetic, we propose to explore a new domain of research concerning the integration of haptic gesture into multimodal dialogue systems, in terms of its possible associations with speech for object reference and manipulation. We focus on the compatibility between haptic gesture and multimodal reference models, and on the consequences of processing this new modality on intelligent system architectures, which has been sufficiently studied from a semantic point of view. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multimodal contextual car-driver interface

    Page(s): 367 - 373
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (462 KB) |  | HTML iconHTML  

    This paper focuses on the design and implementation of a companion contextual car driver interface that proactively assists the driver in managing information and communication. The prototype combines a smart car environment and driver state monitoring, incorporating a wide range of input-output modalities and a display hierarchy. Intelligent agents link information from many contexts, such as location and schedule, and transparently learn from the driver, interacting with the driver only when it is necessary. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Designing transition networks for multimodal VR-interactions using a markup language

    Page(s): 411 - 416
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (338 KB) |  | HTML iconHTML  

    This article presents one core component for enabling multimodal-speech and gesture-driven interaction in and for virtual environments. A so-called temporal Augmented Transition Network (tATN) is introduced. It allows to integrate and evaluate information from speech, gesture, and a given application context using a combined syntactic/semantic parse approach. This tATN represents the target structure for a multimodal integration markup language (MIML). MIML centers around the specification of multimodal interactions by letting an application designer declare temporal and semantic relations between given input utterance percepts and certain application states in a declarative and portable manner. A subsequent parse pass translates MIML into corresponding tATNs which are directly loaded and executed by a simulation engines scripting facility. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modeling output in the EMBASSI multimodal dialog system

    Page(s): 111 - 116
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (270 KB) |  | HTML iconHTML  

    In this paper we present the concept for abstract modeling of output render components. We illustrate how this categorization serves to seamlessly integrate previously unknown output multimodalities coherently into multimodal presentations of the EMBASSI dialog system. We present a case study and conclude with an overview of related work. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.