By Topic

Multimedia Signal Processing, 2007. MMSP 2007. IEEE 9th Workshop on

Date 1-3 Oct. 2007

Filter Results

Displaying Results 1 - 25 of 122
  • [Front cover]

    Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (310 KB)  
    Freely Available from IEEE
  • [Breaker page]

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (281 KB)  
    Freely Available from IEEE
  • [Breaker page]

    Page(s): ii
    Save to Project icon | Request Permissions | PDF file iconPDF (99 KB)  
    Freely Available from IEEE
  • Contributors

    Page(s): iii - iv
    Save to Project icon | Request Permissions | PDF file iconPDF (310 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): v - xvi
    Save to Project icon | Request Permissions | PDF file iconPDF (156 KB)  
    Freely Available from IEEE
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Facial Features Tracking for Gross Head Movement analysis and Expression Recognition

    Page(s): 2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (85 KB)  

    Summary form only given. The tracking and recognition of facial expressions from a single cameras is an important and challenging problem. We present a real-time framework for Action Units(AU)/Expression recognition based on facial features tracking and Adaboost. Accurate facial feature tracking is challenging due to changes in illumination, skin color variations, possible large head rotations, partial occlusions and fast head movements. We use models based on Active Shapes to localize facial features on the face in a generic pose. Shapes of facial features undergo non-linear transformation as the head rotates from frontal view to profile view. We learn the non-linear shape manifold as multiple-overlapping subspaces with different subspaces representing different head poses. The face alignment is done by searching over the non-linear shape manifold and aligning the landmark points to the features' boundaries. The recognized features are tracked across multiple frames using KLT Tracker by constraining the shape to lie on the non-linear manifold. Our tracking framework has been successfully used for detecting both gross head movements, like nodding, shaking and head pose prediction. Further, we use the tracked features to accurately extract bounded faces in a video sequence and use it for recognizing facial expressions. Our approach is based on coded dynamical features. In order to capture the dynamic characteristics of facial events, we design the dynamic haar-like features to represent the temporal variations of facial events. Inspired by the binary pattern coding, we further encode the dynamic haar-like features into binary pattern features, which are useful to construct weak classifiers for boosting learning. Finally Adaboost is used to learn a set of discriminating coded dynamic features for facial active units and expression recognition. We have achieved approximately 97% detection rate for gross head movements like shaking and nodding. The recognition rates fo- r facial expressions averages to -95% for the most important action units. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multimedia Technologies and Solutions for Educational Applications: Opportunities, Trends and Challenges

    Page(s): 3 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4260 KB) |  | HTML iconHTML  

    This report aims to provide an overview of multimedia technologies in education, particularly language technologies, and particularly applications aimed at children. Emphasis is on those aspects that may not necessarily be familiar to engineers, such as linguistics and language pedagogy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • State of the Art and Future Directions in Musical Sound Synthesis

    Page(s): 9 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (212 KB) |  | HTML iconHTML  

    Sound synthesis and processing has been the most active research topic in the field of sound and music computing for more than 40 years. Quite a number of the early research results are now standard components of many audio and music devices and new technologies are continuously being developed and integrated into new products. Through the years there have been important changes. For example, most of the abstract algorithms that were the focus of work in the 70s and 80s are considered obsolete. Then the 1990s saw the emergence of computational approaches that aimed either at capturing the characteristics of a sound source, known as physical models, or at capturing the perceptual characteristics of the sound signal, generally referred to as spectral or signal models. More recent trends include the combination of physical and spectral models and the corpus-based concatenative methods. But the field faces major challenges that might revolutionize the standard paradigms and applications of sound synthesis. In this article, we will first place the sound synthesis topic within its research context, then we will highlight some of the current trends, and finally we will attempt to identify some challenges for the future. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sensor Networks for Ambient Intelligence

    Page(s): 13 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (199 KB) |  | HTML iconHTML  

    Due to rapid advances in networking and sensing technology we are witnessing a growing interest in sensor networks, in which a variety of sensors are connected to each other and to computational devices capable of multimodal signal processing and data analysis. Such networks are seen to play an increasingly important role as key enablers in emerging pervasive computing technologies. In the first part of this paper we give an overview of recent developments in the area of multimodal sensor networks, paying special attention to ambient intelligence applications. In the second part, we discuss how the time series generated by data streams emanating from the sensors can be mined for temporal patterns, indicating cross-sensor signal correlations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Recent advances in brain-computer interfaces

    Page(s): 17
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (82 KB)  

    A brain-computer interface (BCI) is a communication system that translates brain activity into commands for a computer or other devices. In other words, a BCI allows users to act on their environment by using only brain activity, without using peripheral nerves and muscles. The major goal of BCI research is to develop systems that allow disabled users to communicate with other persons, to control artificial limbs, or to control their environment. To achieve this goal, many aspects of BCI systems are currently being investigated. Research areas include evaluation of invasive and noninvasive technologies to measure brain activity, evaluation of control signals (i.e. patterns of brain activity that can be used for communication), development of algorithms for translation of brain signals into computer commands, and the development of new BCI applications. In this paper we give an overview of the aspects of BCI research mentioned above and highlight recent developments and open problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enhancing Social Communication in High-Functioning Children with Autism through a Co-Located Interface

    Page(s): 18 - 21
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1484 KB) |  | HTML iconHTML  

    In this paper we describe a pilot study for an intervention aimed at enhancing social skills in high functioning children with autism. We found initial evidences that the use of a social interaction and may lessen the repetitive behaviors typical of autism. These positive effects also appear to be transferred to other tasks following the intervention. We hypothesize that the effect is due to some unique characteristics of the interfaces used, in particular enforcing some tasks to be done together through the use of multiple-user GUI actions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A review of the acoustic and linguistic properties of children's speech

    Page(s): 22 - 25
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (222 KB) |  | HTML iconHTML  

    In this paper, we review the acoustic and linguistic properties of children's speech for both read and spontaneous speech. First, the effect of developmental changes on the absolute values and variability of acoustic correlates is presented for read speech for children ages 6 and up. Then, verbal child-machine spontaneous interaction is reviewed and results from recent studies are presented. Age trends of acoustic, linguistic and interaction parameters are discussed, such as sentence duration, filled pauses, politeness and frustration markers, and modality usage. Some differences between child-machine and human-human interaction are pointed out. The implications for acoustic modeling, linguistic modeling and spoken dialogue systems design for children are discussed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A System for Technology Based Assessment of Language and Literacy in Young Children: the Role of Multiple Information Sources

    Page(s): 26 - 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (285 KB) |  | HTML iconHTML  

    This paper describes the design and realization of an automatic system for assessing and evaluating the language and literacy skills of young children. This system was developed in the context of the TBALL (technology based assessment of language and literacy) project and aims at automatically assessing the English literacy skills of both native talkers of American English and Mexican-American children in grades K-2. The automatic assessments were carried out employing appropriate speech recognition and understanding techniques. In this paper, we describe the system focusing on the role of the multiple sources of information at our disposal. We present the content of the assessment system, discuss some issues in creating a child-friendly interface, and how to provide a suitable feedback to the teachers. In addition, we will discuss the different assessment modules and the different algorithms used for speech analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Perceptual Enhancement for Fully Scalable Audio

    Page(s): 31 - 34
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (266 KB) |  | HTML iconHTML  

    MPEG-4 scalable lossless (SLS) coding is the latest released ISO international standard for scalable audio coding. Besides its function as an extension of MPEG-4 advanced audio coding (AAC) perceptual audio coder, SLS has a "non-core mode" that is able to offer full scalability. The perceptual audio coder is absent in this mode and scalability is achieved through pure bit-plane coding. In this paper, a perceptually enhanced bit-plane coding method, namely Quad-level bit-plane coding (QBPC) is proposed to enhance the perceptual quality of fully scalable audio at intermediate bitrates. With QBPC structure, the perceptual quality of fully scalable audio coded by SLS is significantly improved in a wide range of intermediate bitrates. Meanwhile this is achieved with trivial added overhead and complexity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Real-Time Continuous Speech Recognition System on SH-4A Microprocessor

    Page(s): 35 - 38
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (375 KB) |  | HTML iconHTML  

    To expand CSR (continuous speech recognition) software to the mobile environmental use, we have developed embedded version of Julius (embedded Julius). Julius is open source CSR software, and has been used by many researchers and developers in Japan as a standard decoder on PCs. In this paper, we describe an implementation of the embedded Julius on a SH-4A microprocessor. SH-4A is a high-end 32-bit MPU (720 MIPS) with on-chip FPU. However, further computational reduction is necessary for the embedded Julius to operate realtime. Applying some optimizations, the embedded Julius achieves real-time processing on the SH-4A. The experimental results show 0.89 times RT(real-time), resulting 4.0 times faster than baseline CSR. We also evaluated the embedded Julius on large vocabulary (20,000 words). It shows almost real-time processing (1.25 times RT). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Impact of Additional Noise on Subjective and Objective Quality Assessement in VoIP

    Page(s): 39 - 42
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2912 KB) |  | HTML iconHTML  

    The main requirement in the Voice over IP technology is a good quality of received voice signal during communication between subscribers. The signal quality can be influenced by many factors such as packet loss, jitter, packet delay, noise etc. and it can be measured by number of methods. The main purpose of this paper is the investigation of an impact of different noise types and different noise levels on the quality assessment in VoIP. The artificial generated noises and real noises obtained from real telecommunications networks were used for testing. The next goal is a comparison of the results obtained by subjective listening tests and objective measuring methods. PESQ and 3SQM were used for objective testing in this paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Joint Analysis of the Emotional Fingerprint in the Face and Speech: A single subject study

    Page(s): 43 - 47
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (239 KB) |  | HTML iconHTML  

    In daily human interaction, speech and gestures are used to express an intended message, enriched with verbal and non-verbal information. Although many communicative goals are simultaneously encoded using the same modalities such as the face or the voice, listeners are generally good at decoding each aspect of the message. This encoding process includes an underlying interplay between communicative goals and channels, which is yet not well understood. In this direction, this paper explores the interplay between linguistic and affective goals in speech and facial expression. We hypothesize that when one modality is constrained by the articulatory speech process, other channels with more degrees of freedom are used to convey the emotions. The results presented here support this hypothesis, since it is observed that facial expression and prosodic speech tend to have a stronger emotional modulation when the vocal tract is physically constrained by the articulation to convey other linguistic communicative goals. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Real-time Emotion Detection System using Speech: Multi-modal Fusion of Different Timescale Features

    Page(s): 48 - 51
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3489 KB) |  | HTML iconHTML  

    The goal of this work is to build a real-time emotion detection system which utilizes multi-modal fusion of different timescale features of speech. Conventional spectral and prosody features are used for intra-frame and supra-frame features respectively, and a new information fusion algorithm which takes care of the characteristics of each machine learning algorithm is introduced. In this framework, the proposed system can be associated with additional features, such as lexical or discourse information, in later steps. To verify the realtime system performance, binary decision tasks on angry and neutral emotion are performed using concatenated speech signal simulating realtime conditions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dual-Mode Wideband Speech Compression

    Page(s): 52 - 55
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2598 KB) |  | HTML iconHTML  

    Many bandwidth extension techniques attempt to predict the high-band frequencies based on features extracted from the lower band. Recent work suggests that such methods are limiting because the correlation between the low band and the high band is insufficient for adequate representation. As a result, additional high-band information must be sent to the decoder. In this paper, we propose a dual mode wideband speech coding algorithm based on the principles of bandwidth extension. The principal contributions include a mode selection algorithm based on greedy algorithm that maximizes the loudness criteria, and a bandwidth extension algorithm based on a constrained MMSE estimator. Results reveal that the proposed system improves the quality of narrowband speech while performing at a lower bit rate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sound Source Localization by Asymmetrically Arrayed 2ch Microphones on a Sphere

    Page(s): 56 - 59
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2021 KB) |  | HTML iconHTML  

    In this paper, we propose a novel system to localize a sound source in any 2D directions using only two microphones. In our system, the two microphones are asymmetrically placed on a sphere, thus, (1) the diffraction by the sphere and the asymmetrical arrangement of the microphones yield the localization cue including the front-back judgment, and (2) unlike the dummy head system, no previous measurements are necessary due to the analytical representation of the sphere diffraction. To deal with reverberation or ambient noises, we consider the maximum likelihood estimation of the direction of arrival with a diffused noise model on a sphere. We present a real system that we built through the investigation of the optimal microphone arrangement for speech, and give experimental results in real environment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multimodal Meeting Monitoring: Improvements on Speaker Tracking and Segmentation through a Modified Mixture Particle Filter

    Page(s): 60 - 65
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2994 KB) |  | HTML iconHTML  

    In this paper we address improvements to our multimodal system for tracking of meeting participants and speaker segmentation with a focus on the microphone array modality. We propose an algorithm that uses Directions-of-Arrival estimated for each microphone pair as observations and performs tracking of an unknown number of acoustically-active meeting participants and subsequent speaker segmentation. We propose modified mixture particle fillter (mMPF) for tracking of acoustic sources in the track-before-detection (TbD) framework. Trajectories of sound sources are reconstructed by the optimal assignment of posterior mixture components produced by mMPF in consecutive frames. Further, we propose a sequential optimal change-point detection algorithm which discovers speech segments in the reconstructed trajectories i.e., performs speaker segmentation. The algorithm is tested on a multi-participant meeting dataset both separately and as a part of the multimodal system. On the task of speaker detection in the multimodal setup we report significant improvement over our previous state of the art implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Systematic comparison of BIC-based speaker segmentation systems

    Page(s): 66 - 69
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (439 KB) |  | HTML iconHTML  

    Unsupervised speaker change detection is addressed in this paper. Three speaker segmentation systems are examined. The first system investigates the AudioSpectrumCentroid and the AudioWaveformEnvelope features, implements a dynamic fusion scheme, and applies the Bayesian Information Criterion (BIC). The second system consists of three modules. In the first module, a second-order statistic-measure is extracted; the Euclidean distance and the T2 Hotelling statistic are applied sequentially in the second module; and BIC is utilized in the third module. The third system, first uses a metric-based approach, in order to detect potential speaker change points, and then the BIC criterion is applied to validate the previously detected change points. Experiments are carried out on a dataset, which is created by concatenating speakers from the TIMIT database. A systematic performance comparison among the three systems is carried out by means of one-way ANOVA method and post hoc Tukey's method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of multimodal binary detection systems based on dependent/independent modalities

    Page(s): 70 - 73
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2128 KB) |  | HTML iconHTML  

    Performance limits of multimodal detection systems are analyzed in this paper. Two main setups are considered, i.e., based on fusion of dependent and independent modalities, respectively. The analysis is performed in terms of attainable probability of detection errors characterized by the corresponding error exponents. It is demonstrated that an expected performance gain from fusion of dependent modalities is superior than in the case when one fuses independent signals. In order to quantify the efficiency of dependent modality fusion versus the independent case, the problem analysis is performed in the Gaussian formulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multimodal Sensor Analysis of Sitar Performance: Where is the Beat?

    Page(s): 74 - 77
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3544 KB) |  | HTML iconHTML  

    In this paper we describe a system for detecting the tempo of sitar performance using a multimodal signal processing approach. Real-time measurements are obtained from sensors on the instrument and by wearable sensors on the performer's body. Experiments comparing audio-based and sensor-based tempo tracking are described. The real-time tempo tracking method is based on extracting onsets and applying Kalman filtering. We show how late fusion of the audio and sensor tempo estimates can improve tracking. The obtained results are used to inform design parameters for a real-time system for human-robot musical performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.