By Topic

Spoken Language Technology Workshop (SLT), 2010 IEEE

Date 12-15 Dec. 2010

Filter Results

Displaying Results 1 - 25 of 93
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (441 KB)  
    Freely Available from IEEE
  • [Title page]

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (427 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): ii
    Save to Project icon | Request Permissions | PDF file iconPDF (424 KB)  
    Freely Available from IEEE
  • Organizing Committee

    Page(s): iii - iv
    Save to Project icon | Request Permissions | PDF file iconPDF (445 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): v - xii
    Save to Project icon | Request Permissions | PDF file iconPDF (449 KB)  
    Freely Available from IEEE
  • Learning from images and speech with Non-negative Matrix Factorization enhanced by input space scaling

    Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (254 KB) |  | HTML iconHTML  

    Computional learning from multimodal data is often done with matrix factorization techniques such as NMF (Non-negative Matrix Factorization), pLSA (Probabilistic Latent Semantic Analysis) or LDA (Latent Dirichlet Allocation). The different modalities of the input are to this end converted into features that are easily placed in a vectorized format. An inherent weakness of such a data representation is that only a subset of these data features actually aids the learning. In this paper, we first describe a simple NMF-based recognition framework operating on speech and image data. We then propose and demonstrate a novel algorithm that scales the inputs of this framework in order to optimize its recognition performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatically assessing acoustic manifestations of personality in speech

    Page(s): 7 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (140 KB) |  | HTML iconHTML  

    In this paper, we present first results on applying a personality assessment paradigm to speech input, and comparing human and automatic performance on this task. We cue a professional speaker to produce speech using different personality profiles and encode the resulting vocal personality impressions in terms of the Big Five NEO-FFI personality traits. We then have human raters, who do not know the speaker, estimate the five factors. We analyze the recordings using signal-based acoustic and prosodic methods and observe high consistency between the acted personalities, the raters' assessments, and initial automatic classification results. This presents a first step towards being able to handle personality traits in speech, which we envision will be used in future voice-based communication between humans and machines. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Significance of anchor speaker segments for constructing extractive audio summaries of broadcast news

    Page(s): 13 - 18
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (221 KB) |  | HTML iconHTML  

    Analysis of human reference summaries of broadcast news showed that humans give preference to anchor speaker segments while constructing a summary. Therefore, we exploit the role of anchor speaker in a news show by tracking his/her speech to construct indicative/informative extractive audio summaries. Speaker tracking is done by Bayesian information criterion (BIC) technique. The proposed technique does not require Automatic Speech Recognition (ASR) transcripts or human reference summaries for training. The objective evaluation by ROUGE showed that summaries generated by the proposed technique are as good as summaries generated by a baseline text summarization system taking manual transcripts as input and summaries generated by a supervised speech summarization system trained using human summaries. The subjective evaluation of audio summaries by humans showed that they prefer summaries generated by proposed technique to summaries generated by supervised speech summarization system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • What is left to be understood in ATIS?

    Page(s): 19 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (109 KB) |  | HTML iconHTML  

    One of the main data resources used in many studies over the past two decades for spoken language understanding (SLU) research in spoken dialog systems is the airline travel information system (ATIS) corpus. Two primary tasks in SLU are intent determination (ID) and slot filling (SF). Recent studies reported error rates below 5% for both of these tasks employing discriminative machine learning techniques with the ATIS test set. While these low error rates may suggest that this task is close to being solved, further analysis reveals the continued utility of ATIS as a research corpus. In this paper, our goal is not experimenting with domain specific techniques or features which can help with the remaining SLU errors, but instead exploring methods to realize this utility via extensive error analysis. We conclude that even with such low error rates, ATIS test set still includes many unseen example categories and sequences, hence requires more data. Better yet, new annotated larger data sets from more complex tasks with realistic utterances can avoid over-tuning in terms of modeling and feature design. We believe that advancements in SLU can be achieved by having more naturally spoken data sets and employing more linguistically motivated features while preserving robustness due to speech recognition noise and variance due to natural language. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust representations for out-of-domain emotions using Emotion Profiles

    Page(s): 25 - 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (128 KB) |  | HTML iconHTML  

    The proper representation of emotion is of vital importance for human-machine interaction. A correct understanding of emotion would allow interactive technology to appropriately respond and adapt to users. In human-machine interaction scenarios it is likely that over the course of an interaction, the human interaction partner will express an emotion not seen during the training of the machine's emotion models. It is therefore crucial to prepare for such eventualities by developing robust representations of emotion that can distinctly represent emotions regardless of whether the data were seen during training of the representation. This novel work demonstrates that an Emotion Profile (EP) representation introduced in [1], a representation composed of the confidences of four binary emotion-specific classifiers, can distinctly represent emotions unseen during training. The classification accuracy increases by only 0.35% over the full dataset when the data excluded from the EP training is included. The results demonstrate that EPs are a robust method for emotion representation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Investigating modality selection strategies

    Page(s): 31 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (311 KB) |  | HTML iconHTML  

    This paper describes a user study about the influence of efficiency on modality selection (speech vs. virtual keyboard/ speech vs. physical keyboard) and perceived mental effort. Efficiency was varied in terms of interaction steps. Based on previous research it was hypothesized that the number of necessary interaction steps determines the preference for a specific modality. Moreover the relationship between perceived mental effort, modality selection and efficiency was investigated. Results showed that modality selection is strongly dependent on the number of necessary interaction steps. Task duration and modality selection showed no correlation. Also a relationship between mental effort and modality selection was not observed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using spoken utterance compression for meeting summarization: A pilot study

    Page(s): 37 - 42
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (137 KB) |  | HTML iconHTML  

    Most previous work on meeting summarization focused on extractive approaches; however, directly concatenating the extracted spoken utterances may not form a good summary. In this paper, we investigate if it is feasible to compress the transcribed spoken utterances and if using the compressed utterances benefits meeting summarization. We model the utterance compression task as a sequence labeling problem, and show satisfying performance using a CRF model that incorporates a variety of features capturing lexical, syntactic, and discourse information. We evaluate the impact of utterance compression on the meeting summarization task using compressed sentences (pre-compression) and original transcripts (post-compression), and find that using the compressed meeting transcripts yields slightly better summarization performance. In general, using sentence compression together with extractive summarization can generate reasonable compressed summaries. This is a step closer to abstractive summarization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unbiased discourse segmentation evaluation

    Page(s): 43 - 48
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (134 KB) |  | HTML iconHTML  

    In this paper, we show that the performance measures Pk and Window Diff, commonly used for discourse, topic, and story segmentation evaluation, are biased in favor of segmentations with fewer or adjacent segment boundaries. By analytical and empirical means, we show how this results in a failure to penalize substantially defective segmentations. Our novel unbiased measure k-κ corrects this, providing a single score that accounts for chance agreement. We also propose additional statistics that may be used to characterize important properties of segmentations such as boundary clumping. We go on to replicate a recent spoken-language topic segmentation experiment, drawing conclusions that are substantially different from previous studies concerning the effectiveness of state-of-the-art topic segmentation algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Detecting authority bids in online discussions

    Page(s): 49 - 54
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (91 KB) |  | HTML iconHTML  

    This paper looks at the problem of detecting a particular type of social behavior in discussions: attempts to establish credibility as an authority on a particular topic. Using maximum entropy modeling, we explore questions related to feature extraction and turn vs. discussion-level modeling in experiments with online discussion text given only a small amount of labeled training data. We also introduce a method for learning interaction words from unlabeled data. Preliminary experiments show that a word-based approach (as used in topic classification) can be used successfully for turn-level modeling, but is less effective at the discussion level. We also find that sentence complexity features are almost as useful as lexical features, and that interaction words are more robust than the full vocabulary when combined with other features. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Utilizing relationships between named entities to improve speech recognition in dialog systems

    Page(s): 55 - 60
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (166 KB) |  | HTML iconHTML  

    In this paper, we address the problem of improving recognition accuracy of spoken named entities in the context of dialog systems for transactional applications. We propose utilizing the knowledge of relationships, that typically exist in many applications, between named entities spoken across different dialog states. For example, in a bank customer database each customer name is associated with one or a few account numbers, addresses and vice versa. We utilize these relationships to build long-term dependency constraints in grammars (and thus in decoding graphs) representing these entities. This enforces the recognizer to use collective evidences from instances of all the entities to improve the recognition accuracy of each individual entity. Experiments conducted to evaluate our approach show significant accuracy improvements on a task of recognizing a person via a name and a location. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving hmm-based extractive summarization for multi-domain contact center dialogues

    Page(s): 61 - 66
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (153 KB) |  | HTML iconHTML  

    This paper reports the improvements we made to our previously proposed hidden Markov model (HMM) based summarization method for multi-domain contact center dialogues. Since the method relied on Viterbi decoding for selecting utterances to include in a summary, it had the inability to control compression rates. We enhance our method by using the forward-backward algorithm together with integer linear programming (ILP) to enable the control of compression rates, realizing summaries that contain as many domain-related utterances and as many important words as possible within a predefined character length. Using call transcripts as input, we verify the effectiveness of our enhancement. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semantic understanding by combining extended CFG parser with HMM model

    Page(s): 67 - 72
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (125 KB) |  | HTML iconHTML  

    This paper presents a method for extracting both syntactic and semantic tags. An extended CFG parser works in conjunction with an HMM model, which handles unknown words and partially known words, to yield a complete syntactic and semantic interpretation of the utterance. Four experiments and applications were performed using the paradigm to show the usefulness of the approach in processing spoken sentences. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Haptic Voice Recognition: Augmenting speech modality with touch events for efficient speech recognition

    Page(s): 73 - 78
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (392 KB) |  | HTML iconHTML  

    This paper proposes the Haptic Voice Recognition (HVR), a multi-modal interface that combines speech and touch sensory inputs to perform voice recognition. These touch inputs form a series of haptic events that provide cues or `landmarks' for word boundaries. These word boundary cues greatly reduce the search space for speech recognition, thereby making the decoding process more efficient and suitable for portable devices with limited compute and memory resources. Furthermore, having the knowledge of word boundaries also suppresses insertion and deletion errors. This is particularly helpful when recognition is performed in noisy environment. In this paper, a series of experiments were conducted to study the feasibility of augmenting touch events to automatic speech recognition and explore its potential benefits. Experiments were conducted with syntactically simulated haptic events on the Wall Street Journal database as well as realistic haptic events acquired using a prototype HVR interface implemented on a touchscreen device. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Probabilistic model-based sentiment analysis of twitter messages

    Page(s): 79 - 84
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (104 KB) |  | HTML iconHTML  

    We present a machine learning approach to sentiment classification on twitter messages (tweets). We classify each tweet into two categories: polar and non-polar. Tweets with positive or negative sentiment are considered polar. They are considered non-polar otherwise. Sentiment analysis of tweets can potentially benefit different parties, such as consumers and marketing researchers, for obtaining opinions on different products and services. We present methods for text normalization of the noisy tweets and their classification with respect to the polarity. We experiment with a mixture model approach for generation of sentimental words, which are later used as indicator features of the classification model. Based on a gold standard manually annotated ensemble of tweets, with the new approach, we obtain F-scores that are relatively 10% better than a classification baseline that uses raw word n-gram features. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Let's Buy Books: Finding eBooks using voice search

    Page(s): 85 - 90
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (312 KB) |  | HTML iconHTML  

    We describe Let's Buy Books, a dialog system that helps users search for eBook titles. In this paper we compare different vector space approaches to voice search and find that a hybrid approach using a weighted sub-space model smoothed with a general model provides the best performance over different conditions and evaluated using both synthetic queries and queries collected from users through questionnaires. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Good grief, i can speak it! preliminary experiments in audio restaurant reviews

    Page(s): 91 - 96
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (113 KB) |  | HTML iconHTML  

    In this paper, we introduce a new envisioned application for speech which allows users to enter restaurant reviews orally via their mobile device, and, at a later time, update a shared and growing database of consumer-provided information about restaurants. During the intervening period, a speech recognition and NLP based system has analyzed their audio recording both to extract key descriptive phrases and to compute sentiment ratings based on the evidence provided in the audio clip. We report here on our preliminary work moving towards this goal. Our experiments demonstrate that multi-aspect sentiment ranking works surprisingly well on speech output, even in the presence of recognition errors. We also present initial experiments on integrated sentence boundary detection and key phrase extraction from recognition output. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The IBM Attila speech recognition toolkit

    Page(s): 97 - 102
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (156 KB) |  | HTML iconHTML  

    We describe the design of IBM's Attila speech recognition toolkit. We show how the combination of a highly modular and efficient library of low-level C++ classes with simple interfaces, an interconnection layer implemented in a modern scripting language (Python), and a standardized collection of scripts for system-building produce a flexible and scalable toolkit that is useful both for basic research and for construction of large transcription systems for competitive evaluations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards accurate recognition for children's oral reading fluency

    Page(s): 103 - 108
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (616 KB) |  | HTML iconHTML  

    Systems for assessing and tutoring reading skills place unique requirements on underlying ASR technologies. This paper presents VersaReader, a system automatically measuring children's oral reading fluency skills. Critical techniques that improve the recognition accuracy and make the system practical are discussed in detail. We show that using a set of linguistic rules learned from a collection of transcriptions, the proposed rule-based language model outperformed traditional n-gram language models. Combined with a specific acoustic model with explicit long silence modeling, plus adaptation, a WER 7.25% was achieved in our test set. The impact of different kinds of rules on performance is also discussed. We demonstrate that VersaReader can provide highly accurate Words Correct Per Minute scores automatically, which are virtually indistinguishable from scores provided by careful human analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unsupervised cross-lingual speaker adaptation for accented speech recognition

    Page(s): 109 - 114
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (158 KB) |  | HTML iconHTML  

    In this paper we present investigations on how the acoustic models in automatic speech recognition can be adapted across languages in unsupervised fashion to improve recognition of speech with a foreign accent. Recognition systems were trained on large Finnish and English corpora, and tested both on monolingual and bilingual material. Adaptation with bilingual and monolingual recognisers was compared. We found out that recognition of foreign accented English with help of Finnish adaptation training data from the same speaker was not improved significantly. However, the recognition of native Finnish using foreign accented English adaptation data was improved significantly. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Muse: An open source speech technology research platform

    Page(s): 115 - 120
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (121 KB) |  | HTML iconHTML  

    This paper introduces the open source muster speech engine (Muse) for speech technology research. The Muse platform abstracts common data types and software as used by speech technology researchers. It is designed to assist researchers in making repeatable experiments that are not hard coded to a specific platform, language, algorithm, or corpus. It contains a script language and a shell where users can interact with various components. The presentation of this paper will be accompanied by a demo at the SLT workshop. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.