Scheduled System Maintenance:
Some services will be unavailable Sunday, March 29th through Monday, March 30th. We apologize for the inconvenience.
By Topic

Audio Language and Image Processing (ICALIP), 2010 International Conference on

Date 23-25 Nov. 2010

Filter Results

Displaying Results 1 - 25 of 352
  • A query-by-humming music information retrieval from audio signals based on multiple F0 candidates

    Publication Year: 2010 , Page(s): 1 - 5
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (369 KB) |  | HTML iconHTML  

    In this paper, we propose a query-by-humming (QbH) system that retrieves musical pieces given as audio signals. Most conventional QbH systems assume that the symbolic melody information is given a priori, which is not always true. In our system, the database for retrieval is generated from 1ch audio signal that contains many sounds. We generate the database by estimating fundamental frequencies (F0) of the audio signals frame by frame. To improve the retrieval accuracy, we exploit multiple F0 candidates to absorb the impact of F0 estimation errors. From the experiment, we obtained about 15 points of improvement by using multiple F0 candidates, compared with the QbH system with only one F0 candidate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Learning hidden variables in Bayesian Networks with Bayesian Entropy Criterion for supervised classification

    Publication Year: 2010 , Page(s): 6 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (109 KB) |  | HTML iconHTML  

    In this paper, we make use of a new criterion, the Bayesian Entropy Criterion (BEC), to learn hidden variable Bayesian Networks for supervised classification. This criterion takes into account the decisional purpose of a model by minimizing the integrated classification entropy. Experiments on real dataset show that BEC performs better than the BIC criterion to select a model minimizing the classification error rate. Learning hidden variable structures with BEC, we can find the more effective hidden variables for supervised classification model, which may reveal some valuable principles of certain domain. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A robust algorithm of double talk detection based on voice activity detection

    Publication Year: 2010 , Page(s): 12 - 15
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (123 KB) |  | HTML iconHTML  

    Double talk detection is used in acoustic echo cancellation system to keep adaptive filter from divergence. This paper describes a new real-time double talk detention algorithm. Voice activity detection algorithm is used to detect the point end of each speech. And then the algorithm uses a logic unit to detected double talk of dialogue. The new algorithm presented in this paper has robustness against noise and can give a quick track when the channel is changed. The simulation show that the performance of this new algorithm work well compared with the DTD which based only on power ratio or cross-correlation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low latency audio pitch shifting in the frequency domain

    Publication Year: 2010 , Page(s): 16 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (388 KB) |  | HTML iconHTML  

    This paper presents a low latency pitch shifting algorithm based on the Short-Time Fourier Transform (STFT). Unlike existing STFT-based implementations of pitch shifting, the presented algorithm is more robust to reductions of the Fourier transform size. As a result, it achieves latencies as low as 12ms and still produces good quality, whereas other algorithms are performing much worse with similar low latency constraints. The presented algorithm also provides an alternate way of mitigating the well-known phasiness problem of the phase vocoder. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-layered features with SVM for Chinese accent identification

    Publication Year: 2010 , Page(s): 25 - 30
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (364 KB) |  | HTML iconHTML  

    In this paper, we propose an approach of multi-layered feature combination associated with support vector machine (SVM) for Chinese accent identification. The multi-layered features include both segmental and suprasegmental information, such as MFCC and pitch contour, to capture the diversity of variations in Chinese accented speech. The pitch contour is estimated using cubic polynomial method to model the variant characters in different accents in Chinese. We train two GMM acoustic models in order to express the features of a certain accent. As the original criterion of the GMM model cannot deal with such multi-layered features, the SVM is utilized to make the decision. The effectiveness of the proposed approach was evaluated on the 863 Chinese accent corpus. Our approach yields a significant 10% relative error rate reduction compared with traditional approaches using sole feature at single level in Chinese accented speech identification. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance analysis and evaluation of AVS-M audio coding

    Publication Year: 2010 , Page(s): 31 - 36
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (164 KB) |  | HTML iconHTML  

    AVS-M audio standard, which targets at wireless network and mobile equipment, is now independently drawn up in China. Its framework is similar to that of AMR-WB+. The performance of AVS-M audio core algorithms is analyzed in this paper. In order to analyze its complexity fixed-point version of AVS-M codec are implemented on DSP platform. At last, a performance evaluation between AVS-M and AMR-WB+ is discussed. The result is that AVS-M audio performance is no worse than AMR-WB+ on average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-pitch determination algorithm based on mixture laplacian distribution

    Publication Year: 2010 , Page(s): 37 - 41
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (107 KB) |  | HTML iconHTML  

    A multi-pitch determination algorithm based on mixture Laplacian distribution (MLD) is proposed. MLD replaces the autocorrelation function (ACF) of correlogram which shows the possibility of the lag being the pitch period. The peaks of summary MLDs indicate the multiple pitch periods. Compared with summary correlogram, summary MLDs has better resolution and less pseudo peaks which do not correspond to the pitch period. The proposed algorithm is evaluated on a database of speech utterances mixed with various types of interference. The comparisons show that our algorithm has better performances. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient loudness estimation algorithm for speech signal

    Publication Year: 2010 , Page(s): 42 - 45
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (127 KB) |  | HTML iconHTML  

    The meaning of sound quality assessment is summarized. Based on the analysis of some characteristics of human auditory system and the model of loudness perception, an efficient loudness estimation approximate algorithm is proposed using the regression analysis method. Experimental results show that the approximate algorithm has fine performance for speech signal. Additionally, different people's speech signal loudness is also distinct. Based on this, other psychoacoustical parameters for objective speech quality assessment may be obtained. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integration of the podcasting system and multimedia tools in second language teaching: Practice, evaluation, and future implications in the theory of second language acquisition

    Publication Year: 2010 , Page(s): 46 - 51
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (220 KB) |  | HTML iconHTML  

    The project which will be presented in this paper tries to introduce a new type of English presentation course in Japan. The main concept of this course design is the application of the podcasting system and ICT skills with the main emphasis on students' involvement in the creation of podcast movie. Starting with the overview the situation in Japan in terms of foreign language acquisition, this paper claims that there is a great necessity of providing students with greater motivation to use a foreign language. Under these backgrounds, this paper suggests that the supplementary tools like podcasting system introduced in this paper help a lot to enhance students' motivation. On the basis of the questionnaire, this paper concludes that the system serves more than the normal level of satisfaction compared with more high-level PC based learning management systems with much more functions and “costs.” The paper further illustrates how the class went on. This project crucially depends on “Creation” of a product with the use of English and ICT. It is clear that students must do much more things compared with trying traditional presentation. This, however, doesn't worsen their motivation to try to use English for a specific purpose. Lastly and conclusively, some future implications will be given. On the basis of our feedback research, we would like to emphasize the point that the ICT will, if properly adopted, surely enhance students' motivation, which is a crucial point assumed in our project. Further improvements in ICT will lead to further variation of teaching method in the field of foreign language acquisition. Our paper introduces one such case study which is conducted in our institution, aiming to create a local promotion video in a targeted foreign language. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Segmental HMM-based Part-of-speech tagger

    Publication Year: 2010 , Page(s): 52 - 56
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (86 KB) |  | HTML iconHTML  

    This paper presents a solution in order to solve the problem of using HMM-based POS tagger in some languages where a word can be comprised of several tokens. Viterbi algorithm is modified in order to support segment of words within a model state. In the other word, the proposed system has a built-in tokenizer where indicates words boundaries as well as its corresponding tag sequence. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Monaural voiced speech segregation based on combined cues and energy distribution

    Publication Year: 2010 , Page(s): 57 - 63
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (174 KB) |  | HTML iconHTML  

    Monaural speech segregation is important for speech signal processing, and it has been extensively studied on the basis of auditory scene analysis principles. However, current segregation algorithms can not achieve satisfactory performance in high frequency range. In this paper, we propose a system for monaural voiced speech segregation, in which two novel ideas are investigated. First, combined cues (including cross-channel correlation, temporal continuity, and onset/offset) are employed to generate segments in high frequency range. Second, the energy distribution of mixed signal is employed to indicate the reliabilities of cues in high frequency range, according to which, an alternative segmentation strategy is performed. Systematic evaluation and comparison show that the proposed system produces improvement on SNR gain. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speech enhancement using generalized least absolute deviation estimation

    Publication Year: 2010 , Page(s): 64 - 68
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (801 KB) |  | HTML iconHTML  

    Based on a novel generalized least absolute deviation (GLAD) method, this paper proposes an effective speech enhancement algorithm for the removal of noise from speech signal. Parameters of speech signal modeled as autoregressive (AR) process are well estimated by the GLAD method and thus the speech signal can be well recovered from Kalman filtering. Simulation results show that the proposed GLAD-estimation-based algorithm possesses indeed good speech enhancement performance than the Kalman filtering algorithms based on the second-order estimation and the high-order estimation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A research of objective evaluation method for audio brightness

    Publication Year: 2010 , Page(s): 69 - 73
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (84 KB) |  | HTML iconHTML  

    The algorithm analyzes the definition and description of subjective feeling of "brightness" combining with psychoacoustic parameters, it extracts modulation frequency, critical band, loudness level and spectral distance as characteristic parameters of the "brightness" evaluation model, and modify the spectral distance and function of sound pressure level to better suit for the feature of brightness. These features are weighted as the formula of brightness. Meanwhile, MOS(mean opinion scores) is applied to subjective evaluation to get subjective results, comparing with objective result, then their subjective and objective relevance will be obtained. The simulation result shows that the correlation of objection and subjection can reach up to 0.884. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A lightweight processing for conversion of whispering voice into normal speech

    Publication Year: 2010 , Page(s): 74 - 79
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (218 KB) |  | HTML iconHTML  

    Analyzing the whispering voice signal, it is considered to be as understandable as normal speech when heard at proximity and can also carry prosodic information. The study introduced here is centered on developing a signal processing application, not a signal recognizing application, where the whispering (non-speaking voice or talk without vocal fold activation) signal is converted to pseudo-real voice signal similar to normal speech. The method proposed in this study tries to achieve its objective without relying on any specific hardware device, making use of the whispering voice signal's own properties and information. Finally, validation of the proposed method has been conduced and the results analyzed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A new voice activity detection method using maximized Sub-band SNR

    Publication Year: 2010 , Page(s): 80 - 84
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (187 KB) |  | HTML iconHTML  

    This paper presents a novel voice activity detection (VAD) method using Maximum Values of Sub-band SNR (MVSS) as the detection feature. The proposed new feature MVSS has different distributions between speech and non-speech signal, which is helpful for separating the speech signal from heavy noise. An adaptive threshold is applied to improve VAD accuracies and track the noisy signal rapidly without complex computation. Experimental results show that the proposed method achieves better performance than the conventional ETSI AMR VADs under the NOISEX-92 database. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Notice of Violation of IEEE Publication Principles
    An automatic data-driven technique for selecting background dataset in GMM-SVM speaker verification system

    Publication Year: 2010 , Page(s): 85 - 89
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (102 KB) |  | HTML iconHTML  

    Notice of Violation of IEEE Publication Principles

    "An Automatic Data-driven Technique for Selecting Background Dataset in GMM-SVM Speaker Verification System"
    by Jinchao Yang, Haipeng Wang, Jianping Zhang, Yonghong Yan International Conference on Audio Language and Image Processing (ICALIP), 2010, pp. 85-89

    After careful and considered review of the content and authorship of this paper by a duly constituted expert committee, this paper has been found to be in violation of IEEE's Publication Principles.

    This paper contains significant portions of original text from the paper cited below. The original text was copied without attribution (including appropriate references to the original author(s) and/or paper title) and without permission.

    "Improved SVM Speaker Verification Through Data-Driven Background Dataset Selection"
    by Mitchell McLaren, Brendan Baker, Robbie Vogt, Sridha Sridharan
    IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2009, pp. 4041-4044

    In this paper, we propose an automatic data-driven technique for selecting proper background dataset. By the technique, impostor confidence(IC) is proposed as a metric and more discriminative background dataset is automatically chose by impostor confidence(IC) to train more discriminative model. Experiment results on NIST 2008 SRE corpus in GMM-SVM speaker verification system show that the proposed approach obtains better performance. Relative decline in mincost of 8.9% in female and 4.6% in male is obtained. with female and male combined, 5.4% relative decline in mincost is obtained over Heuristically selected background dataset. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improved H.264 video compression based on Cubic Macro-blocks

    Publication Year: 2010 , Page(s): 90 - 93
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (190 KB) |  | HTML iconHTML  

    In the conventional H.264 standard, specified size of Macro-blocks (MBs) is cut from one single frame or field. For every frame, the encoded data are mainly composited with Motion vectors (MVs) and the residuals after Motion Compensation (MC) for all MBs in the frame. However, observing the fact that MVs may keep relatively constant in some videos containing slow changing scenes. This indicates that it is unnecessary to estimate the MVs for every frame, but using the same MVs for several adjacent frames. Consequently, the compression and computational efficiency will be improved significantly. Both concepts of “Virtual Frames” (VF) and Cubic Macro-Blocks (CMB) are introduced to achieve this goal. The residual generated after MC will be processed using 3D-DCT to achieve high compression. Through Matlab simulation, the efficiency of this algorithm is substantiated. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SIR-based call admission control by more realistic imperfect power control for CDMA system

    Publication Year: 2010 , Page(s): 94 - 97
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (177 KB) |  | HTML iconHTML  

    This paper proposed a more realistic imperfect power control model for Signal-to-interference radio (SIR)-based call admission control algorithms comparing with the previous method in CDMA systems. A new call is accepted by a base station if the measured SIR can be guaranteed based on our proposed model. Similarly, we considering the voice, data multimedia traffic services, etc. The simulation results show that the proposed model used in the same CAC scheme outperforms the previous model used in the same schemes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Perceptual quality optimized FEC for video streaming

    Publication Year: 2010 , Page(s): 98 - 102
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (126 KB) |  | HTML iconHTML  

    Video transmission over Internet is often afflicted by various packet losses. The packet-lever forward error correction (FEC) is often used for error recovery in order to enhance the end-to-end video quality. However, the quality metrics used in traditional FEC scheme designing such as peak signal-to-noise ratio (PSNR) have been founded not correlate well with human perceptual quality and usually leads to a suboptimal choice of FEC coding parameters. This paper proposes a perceptual quality optimized FEC scheme for video streaming over lossy channels, where a perceptual quality metric is developed to optimize the FEC coding at limited bandwidth. Both analytical and experimental approaches are used to build the perceptual video quality metric. Based on the proposed metric, the perceptual quality optimized FEC parameters are selected in a rate-quality optimization way. Experimental results show the optimized FEC performs better than traditional algorithm in the sense of improving the end-user perceptual quality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A novel blind digital watermarking algorithm based on quantization for Jiaguwen rubbing image in Contourlet domain

    Publication Year: 2010 , Page(s): 103 - 107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (198 KB) |  | HTML iconHTML  

    We present a novel blind digital watermarking algorithm using the quantization for Jaguwen rubbing image based on Contourlet transform in this paper. After Contourlet transform, original image is decomposed into a series of multiscale, local and directional subimages. Select the low frequency subband as the watermarking embedding domain. At the same time, the original binary watermark is firstly scrambled by two-dimensional Arnold transform, and then embedded into the selected points. Then the coefficients of low frequency sub-band are modified using the quantization. The retrieving watermark algorithm is a blind detecting process, and it does not need original image. The experimental results show that the proposed watermarking algorithm is able to resist attacks, such as JPEG compression, noising, cropping and other attacks, and the watermarking is invisible and robust. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An enhanced screen codec for live lecture broadcasting

    Publication Year: 2010 , Page(s): 108 - 114
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (212 KB) |  | HTML iconHTML  

    In this paper, we present the design of SJSC, an enhanced screen codec proposed by us for live lecture broadcasting. Compared to other screen codecs available, SJSC is more suitable for live broadcasting of a lecture on an unreliable transport medium like the Internet due to two features. First, an error-resilience mechanism controls the impact introduced by packet loss onto the decoded screen sequence. Second a unique frame-rate self-adaptation mechanism ensures the bit rate of the encoded stream is capped at a level that the underlying transport can sustainably support. SJSC has been successfully deployed in our p2p streaming based live lecture broadcast system, named PPClass, which has been put into real use in our college for more than 6 months and served 300+ classes for an accumulated 9000+ remote student base. Both subject assessment by real students and objective evaluation experiments have proved the effectiveness of its design. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A novel ρ-q/Distortion model for H.264 rate control algorithm

    Publication Year: 2010 , Page(s): 115 - 119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (85 KB) |  | HTML iconHTML  

    Rate control aims to maximize reconstructed video quality while adapting coding bit rate to the fluctuating network bandwidth. This paper proposes an easy rate control algorithm for H.264 video coding based on a novel ρ-q/Distortion model, whose data line shows quite a pleasing goodness-of-fit. Experimental results have demonstrated that, compared with rate control mechanism suggested in H.264 reference software, the proposed model provides a better reconstructed visual quality with a coding gain up to 0.37dB as well as much smaller bit rate control error with a maximum reduced bit rate up to 2.04kbps. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A new scheme of layered coding of stereoscopic video

    Publication Year: 2010 , Page(s): 120 - 124
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (114 KB) |  | HTML iconHTML  

    A new scheme of layered coding of stereoscopic video based on the correlation of motion vector fields (MV fields) is proposed in this paper, in which the bit streams robustness is improved to error channel and the coding efficiency is not reduced evidently. In our proposed scheme, the stereoscopic video is classified into one base layer and two enhancement layers. First, left-view video is encoded as Base Layer to ensure the basic quality of two-dimensional video. Then, the disparity vectors (DVs) of right-view video are encoded as Enhancement Layer 1. To lower bit rates further and reduce the network load, the strong correlation of the motion vectors (MVs) for two views video sequence are tested. Based on the correlation of MVs between views, the MVs of right-view are transmitted as Enhancement Layer 2 only if the correlation between left and right view is less than some threshold. Experimental results are demonstrated to testify the efficiency of our proposed scheme. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CUDA-based directional image/video interpolation

    Publication Year: 2010 , Page(s): 125 - 129
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (187 KB) |  | HTML iconHTML  

    Image/Video scaling is one of the most important applications in the image processing or video post processing. At present, in order to realize real time processing, the applications are normally based on some simple but poor-quality interpolation algorithm. For a high-quality image/video scaling, a fast interpolation algorithm is desired for real-time purpose. In this paper, we use the CUDA(Compute Unified Device Architecture) technology to accelerate an existing NEDI(New Edge-Directed Interpolation) algorithm to meet the need of real-time processing and desired scaling quality. Experimental results show that for NEDI algorithm the CUDA-based implementation can be more than 100 times faster than the original implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adding 3D sound to 3D cinema: Identification and evaluation of different reproduction techniques

    Publication Year: 2010 , Page(s): 130 - 137
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (124 KB) |  | HTML iconHTML  

    Very little research has been conducted so far into the general problem of producing a 3D soundscape consistent with the visual content of a 3D-stereoscopic movie. First, the following 3D sound reproduction techniques are reviewed: Vector Base Amplitude Panning (VBAP), binaural and transaural techniques, Wave Field Synthesis (WFS) and Ambisonics. Second, the new challenges of 3D cinema are introduced. Third, we reconsider each 3D audio technique in the light of these challenges. We find that, at least in theory, a completely personalized soundscape is needed and that, due to various technical reasons, only binaural reproduction through headphones is able to accurately produce a 3D soundscape consistent with a 3D-stereo movie in a theater environment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.