By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 8 • Date Nov. 2002

Filter Results

Displaying Results 1 - 15 of 15
  • Editorial

    Publication Year: 2002 , Page(s): 529 - 530
    Save to Project icon | Request Permissions | PDF file iconPDF (184 KB)  
    Freely Available from IEEE
  • List of reviewers

    Publication Year: 2002 , Page(s): 659
    Save to Project icon | Request Permissions | PDF file iconPDF (138 KB)  
    Freely Available from IEEE
  • Author index

    Publication Year: 2002 , Page(s): 660 - 662
    Save to Project icon | Request Permissions | PDF file iconPDF (187 KB)  
    Freely Available from IEEE
  • Subject index

    Publication Year: 2002 , Page(s): 662 - 669
    Save to Project icon | Request Permissions | PDF file iconPDF (212 KB)  
    Freely Available from IEEE
  • Low-bitrate distributed speech recognition for packet-based and wireless communication

    Publication Year: 2002 , Page(s): 570 - 579
    Cited by:  Papers (20)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (568 KB) |  | HTML iconHTML  

    We present a framework for developing source coding, channel coding and decoding as well as erasure concealment techniques adapted for distributed (wireless or packet-based) speech recognition. It is shown that speech recognition as opposed to speech coding, is more sensitive to channel errors than channel erasures, and appropriate channel coding design criteria are determined. For channel decoding, we introduce a novel technique for combining at the receiver soft decision decoding with error detection. Frame erasure concealment techniques are used at the decoder to deal with unreliable frames. At the recognition stage, we present a technique to modify the recognition engine itself to take into account the time-varying reliability of the decoded feature after channel transmission. The resulting engine, referred to as weighted Viterbi recognition, further improves the recognition accuracy. Together, source coding, channel coding and the modified recognition engine are shown to provide good recognition accuracy over a wide range of communication channels with bit rates of 1.2 kbps or less. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Perception-based partial encryption of compressed speech

    Publication Year: 2002 , Page(s): 637 - 643
    Cited by:  Papers (25)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (605 KB) |  | HTML iconHTML  

    Mobile multimedia applications, the focus of many forthcoming wireless services, increasingly demand low-power techniques implementing content protection and customer privacy. In this paper low complexity perception-based partial encryption schemes for speech are presented. Speech compressed by a widely-used speech coding algorithm, the ITU-T G.729 standard at 8 kb/s, is partitioned in two classes, one, the most perceptually relevant, to be encrypted, the other, to be left unprotected. Two partial-encryption techniques are developed, a low-protection scheme, aimed at preventing most kinds of eavesdropping and a high-protection scheme, based on the encryption of a larger share of perceptually important bits and meant to perform as well as full encryption of the compressed bitstream. The high-protection scheme, based on the encryption of about 45% of the bitstream, achieves content protection comparable to that obtained by full encryption, as verified by both objective measures and formal listening tests. For the low-protection scheme, encryption of as little as 30% of the bitstream virtually eliminates intelligibility as well as most of the remaining perceptual information. Low-power, portable devices could therefore achieve very high levels of speech-content protection at only 30-45% of the computational load of current techniques, freeing resources for other tasks and enabling longer battery life. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Chip design of portable speech memopad suitable for persons with visual disabilities

    Publication Year: 2002 , Page(s): 644 - 658
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1226 KB)  

    This paper presents the design of a speech recognition and compression chip for portable memopad devices, especially suitable for use by the visually impaired. The proposed chip design is based on several cores of which they can be regarded as intellectual property (IP) cores to be used for a variety of speech-related application systems. A cepstrum extraction core and a dynamic warping core are designed for mapping the speech recognition algorithms. In the cepstrum extraction core, a novel architecture computes the autocorrelation between the overlapping frames using two pairs of shift registers and an intelligent accumulation procedure. The architecture of the dynamic time warping core uses only a single processing element, and is based on our extensive study of the relationship among the nodes in the dynamic time warping lattice. Bit rate is the key factor affecting the memory size for speech compression; therefore, a very low bit-rate speech coder is used. The speech coder exploits a line-spectrum-based interpolation method, which yields fine quality synthesized speech despite the low 1.6 kbps bit rate. The 1.6 kbps vocoder core is cost-effective, and it integrates both encoder and decoder algorithms. The proposed design has been tested via hardware simulations on Xilinx Virtex series FPGAs and a semi-custom chip fabricated by 0.35 μm CMOS single-poly-four-metal technology on a die size approximately 4.46×4.46 mm2. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A system for spoken query information retrieval on mobile devices

    Publication Year: 2002 , Page(s): 531 - 541
    Cited by:  Papers (14)  |  Patents (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (569 KB) |  | HTML iconHTML  

    With the proliferation of handheld devices, information access on mobile devices is a topic of growing relevance. This paper presents a system that allows the user to search for information on mobile devices using spoken natural-language queries. We explore several issues related to the creation of this system, which combines state-of-the-art speech-recognition and information-retrieval technologies. This is the first work that we are aware of which evaluates spoken query based information retrieval on a commonly available and well researched text database, the Chinese news corpus used in the National Institute of Standards and Technology (NIST)s TREC-5 and TREC-6 benchmarks. To compare spoken-query retrieval performance for different relevant scenarios and recognition accuracies, the benchmark queries-read verbatim by 20 speakers-were recorded simultaneously through three channels: headset microphone, PDA microphone, and cellular phone. Our results show that for mobile devices with high-quality microphones, spoken-query retrieval based on existing technologies yields retrieval precisions that come close to that for perfect text input (mean average precision 0.459 and 0.489, respectively, on TREC-6). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The adaptive multirate wideband speech codec (AMR-WB)

    Publication Year: 2002 , Page(s): 620 - 636
    Cited by:  Papers (73)  |  Patents (30)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2260 KB) |  | HTML iconHTML  

    This paper describes the adaptive multirate wideband (AMR-WB) speech codec selected by the Third Generation Partnership Project (3GPP) for GSM and the third generation mobile communication WCDMA system for providing wideband speech services. The AMR-WB speech codec algorithm was selected in December 2000 and the corresponding specifications were approved in March 2001. The AMR-WB codec was also selected by the International Telecommunication Union-Telecommunication Sector (ITU-T) in July 2001 in the standardization activity for wideband speech coding around 16 kb/s and was approved in January 2002 as Recommendation G.722.2. The adoption of AMR-WB by ITU-T is of significant importance since for the first time the same codec is adopted for wireless as well as wireline services. AMR-WB uses an extended audio bandwidth from 50 Hz to 7 kHz and gives superior speech quality and voice naturalness compared to existing second- and third-generation mobile communication systems. The wideband speech service provided by the AMR-WB codec will give mobile communication speech quality that also substantially exceeds (narrowband) wireline quality. The paper details AMR-WB standardization history, algorithmic description including novel techniques for efficient ACELP wideband speech coding and subjective quality performance of the codec. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance improvement of a bitstream-based front-end for wireless speech recognition in adverse environments

    Publication Year: 2002 , Page(s): 591 - 604
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (653 KB) |  | HTML iconHTML  

    We propose a feature enhancement algorithm for wireless speech recognition in adverse acoustic environments. A speech recognition system is realized at the network side of a wireless communications system and feature parameters are extracted directly from the bitstream of the speech coder employed in the system, where the feature parameters are composed of spectral envelope information and coder-specific information. The coder-specific information is apt to be affected by environmental noise because the speech coder fails to generate high quality speech in noisy environments. We first found that enhancing noisy speech prior to speech coding improves the recognizer's performance. However, our aim was to develop a robust front-end operating at the network side of a wireless communications system without regard to whether speech enhancement was applied at the sender side. We investigated the effect of a speech enhancement algorithm on the bitstream-based feature parameters. Consequently, a feature enhancement algorithm is proposed which incorporates feature parameters obtained from the decoded speech and a noise suppressed version of the decoded speech. The coder-specific information can also be improved by re-estimating the codebook gains and residual energy from the enhanced residual signal. HMM-based connected digit recognition experiments show that the proposed feature enhancement algorithm significantly improves recognition performance at low signal-to-noise ratio (SNR) without causing poorer performance at high SNR. From large vocabulary speech recognition experiments with far-field microphone speech signals recorded in an office environment, we show that the feature enhancement algorithm greatly improves word recognition accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A robust high accuracy speech recognition system for mobile applications

    Publication Year: 2002 , Page(s): 551 - 561
    Cited by:  Papers (12)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (584 KB) |  | HTML iconHTML  

    This paper describes a robust, accurate, efficient, low-resource, medium-vocabulary, grammar-based speech recognition system using hidden Markov models for mobile applications. Among the issues and techniques we explore are improving robustness and efficiency of the front-end, using multiple microphones for removing extraneous signals from speech via a new multichannel CDCN technique, reducing computation via silence detection, applying the Bayesian information criterion (BIC) to build smaller and better acoustic models, minimizing finite state grammars, using hybrid maximum likelihood and discriminative models, and automatically generating baseforms from single new-word utterances. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributed speech processing in miPad's multimodal user interface

    Publication Year: 2002 , Page(s): 605 - 619
    Cited by:  Papers (16)  |  Patents (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1748 KB) |  | HTML iconHTML  

    This paper describes the main components of MiPad (multimodal interactive PAD) and especially its distributed speech processing aspects. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution for data entry in PDAs or smart phones, often done by pecking with tiny styluses or typing on minuscule keyboards. Our user study indicates that the throughput of MiPad is significantly superior to that of the existing pen-based PDA interface. Acoustic modeling and noise robustness in distributed speech recognition are key components in MiPad's design and implementation. In a typical scenario, the user speaks to the device at a distance so that he or she can see the screen. The built-in microphone thus picks up a lot of background noise, which requires MiPad be noise robust. For complex tasks, such as dictating e-mails, resource limitations demand the use of a client-server (peer-to-peer) architecture, where the PDA performs primitive feature extraction, feature quantization, and error protection, while the transmitted features to the server are subject to further speech feature enhancement, speech decoding and understanding before a dialog is carried out and actions rendered. Noise robustness can be achieved at the client, at the server or both. Various speech processing aspects of this type of distributed computation as related to MiPad's potential deployment are presented. Previous user interface study results are also described. Finally, we point out future research directions as related to several key MiPad functionalities. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Graceful degradation of speech recognition performance over packet-erasure networks

    Publication Year: 2002 , Page(s): 580 - 590
    Cited by:  Papers (18)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (379 KB) |  | HTML iconHTML  

    This paper explores packet loss recovery for automatic speech recognition (ASR) in spoken dialog systems, assuming an architecture in which a lightweight client communicates with a remote ASR server. Speech is transmitted with source and channel codes optimized for the ASR application, i.e., to minimize word error rate. Unequal amounts of forward error correction, depending on the data's effect on ASR performance, are assigned to protect against packet loss. Experiments with simulated packet loss in a range of loss conditions are conducted on the DARPA Communicator (air travel information) task. Results show that the approach provides robust ASR performance which degrades gracefully as packet loss rates increase. Transmitting at 5.2 Kbps with up to 200 ms added delay, leads to only a 7% relative degradation in word error rate even under extremely adverse network conditions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multistage algorithm for spotting new words in speech

    Publication Year: 2002 , Page(s): 542 - 550
    Cited by:  Papers (18)  |  Patents (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (422 KB) |  | HTML iconHTML  

    In this paper, we present a fast, vocabulary independent, algorithm for spotting words in speech. The algorithm consists of a phone-ngram representation (indexing) stage and a coarse-to-detailed search stage for spotting a word/phone sequence in speech. The phone-ngram representation stage provides a phoneme-level representation of the speech that can be searched efficiently. We present a novel method for phoneme-recognition using a vocabulary prefix tree to guide the creation of the phone-ngram index. The coarse search, consisting of phone-ngram matching, identifies regions of speech as putative word hits. The detailed acoustic match is then conducted only at the putative hits identified in the coarse match. This gives us vocabulary independence and the desired accuracy and speed in wordspotting. Current lattice-based phoneme-matching algorithms are similar to the coarse-match step of our algorithm. We show that our combined algorithm gives a factor of two improvement over the coarse match. The algorithm has wide-ranging use in distributed and pervasive speech recognition applications such as audio-indexing, spoken message retrieval and video-browsing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ASR in mobile phones - an industrial approach

    Publication Year: 2002 , Page(s): 562 - 569
    Cited by:  Papers (8)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (260 KB) |  | HTML iconHTML  

    In order to make hidden Markov model (HMM) speech recognition suitable for mobile phone applications, Siemens developed a recognizer, Very Smart Recognizer (VSR), for deployment in future mobile phone generations. Typical applications will be name dialling, command and control operations suited for different environments, for example in cars. The paper describes research and development issues of a speech recognizer in mobile devices focusing on noise robustness, memory efficiency and integer implementation. The VSR is shown to reach a word error rate as low as 4.1% on continuous digits recorded in a car environment. Furthermore by means of discriminative training and HMM-parameter coding, the memory requirements of the VSR HMMs are smaller than 64 kBytes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope