By Topic

Multimedia, IEEE Transactions on

Issue 3 • Date June 2006

Filter Results

Displaying Results 1 - 25 of 28
  • Table of contents

    Page(s): c1 - c4
    Save to Project icon | Request Permissions | PDF file iconPDF (126 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Multimedia publication information

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (32 KB)  
    Freely Available from IEEE
  • Semantic adaptation of sport videos with user-centred performance analysis

    Page(s): 433 - 443
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1088 KB) |  | HTML iconHTML  

    In semantic video adaptation measures of performance must consider the impact of the errors in the automatic annotation over the adaptation in relationship with the preferences and expectations of the user. In this paper, we define two new performance measures Viewing Quality Loss and Bit-rate Cost Increase,that are obtained from classical peak signal-to-noise ration (PSNR) and bitrate, and relate the results of semantic adaptation to the errors in the annotation of events and objects and the user's preferences and expectations. We present and discuss results obtained with a system that performs automatic annotation of soccer sport video highlights and applies different coding strategies to different parts of the video according to their relative importance for the end user. With reference to this framework, we analyze how highlights' statistics and the errors of the annotation engine influence the performance of semantic adaptation and reflect into the quality of the video displayed at the user's client and the increase of transmission costs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Deep compression of remotely rendered views

    Page(s): 444 - 456
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3320 KB) |  | HTML iconHTML  

    Three-dimensional (3-D) models are information-rich and provide compelling visualization effects. However downloading and viewing 3-D scenes over the network may be excessive. In addition low-end devices typically have insufficient power and/or memory to render the scene interactively in real-time. Alternatively,3-D image warping, an image-based-rendering technique that renders a two-dimensional(2-D) depth view to form new views intended from different viewpoints and/or orientations, may be employed on a limited device. In a networked 3-D environment,the warped views may be further compensated by the graphically rendered views and transmitted to clients at times. Depth views can be considered as a compact model of 3-D scenes enabling the remote rendering of complex 3-D environment on relatively low-end devices. The major overhead of the 3-D image warping environment is the transmission of the depth views of the initial and subsequent references. This paper addresses the issue by presenting an effective remote rendering environment based on the deep compression of depth views utilizing the context statistics structure present in depth views. The warped image quality is also explored by reducing the resolution of the depth map. It is shown that proposed deep compression of the remote rendered view significantly outperforms the JPEG2000 and enables the realtime rendering of remote 3-D scene while the degradation of warped image quality is visually imperceptible for the benchmark scenes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient scheme for motion estimation using multireference frames in H.264/AVC

    Page(s): 457 - 466
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (784 KB) |  | HTML iconHTML  

    The multiple reference frame motion compensation (MRMC) supported by H.264 makes use of the redundancy which is between multiple frames to enhance the coding efficiency over a scheme using the single reference frame motion compensation (SRMC) in which motion vectors are searched over a single reference frame. And, the technique using multiple reference frames can combat the channel errors efficiently. However, searching the motion vectors in multiple frames may require a huge computing time. This paper proposes a novel motion estimation procedure, which has a lower search complexity without sacrificing image quality. To reduce the complexity of motion estimation procedure, we use a temporary motion vector generated with little computation. The temporary motion vector is calculated from the motion vector map composed of motion vectors between successive frames, and used to predict the optimal motion vector for a reference frame. The proposed scheme requires the lower complexity than conventional schemes by using the temporary motion vector and refinement process over a narrow search range around the temporary predictive motion vector. Since the temporary predictive motion vector effectively chases the optimal motion vector for each reference frame, the encoded image quality by proposed scheme is very similar to that of full search algorithm. The proposed motion estimation process consists of three phases: 1)making a vector map between two consecutive frames, where the vector map is constructed by copying motion vectors which have been estimated in first reference frame, 2) composing a temporary motion vector with element vectors which are in the vector map, and 3) finally, the temporary predictive motion vector is refined over a narrow search range. We show experimental results which demonstrate the effectiveness of the proposed method. To compare the proposed motion estimation algorithm with the conventional schemes, we check the CPU times consumed by ME module in H.264 encoder using the proposed scheme. In the results, CPU time consumed by the proposed scheme has been reduced significantly without additional distortion of the encoded video quality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-delay rate control for real-time H.264/AVC video coding

    Page(s): 467 - 477
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (560 KB)  

    This paper presents an efficient rate control scheme for the H.264/AVC video coding in low-delay environments. In our scheme, we propose an enhancement to the buffer-status based H.264/AVC bit allocation method. The enhancement is by using a PSNR-based frame complexity estimation to improve the existing mean absolute difference based (MAD-based) complexity measure. Bit allocation to each frame is not just computed by encoder buffer status but also adjusted by a combined frame complexity measure. To prevent the buffer from undesirable overflow or underflow under small buffer size constraint in low delay environment,the computed quantization parameter (QP) for the current MB is adjusted based on actual encoding results at that point. We also propose to compare the bits produced by each mode with the average target bits per MB to dynamically modify Lagrange multiplier (λMODE) for mode decision. The objective of QP and λMODE adjustment is to produce bits as close to the frame target as possible, which is especially important for low delay applications. Simulation results show that the H.264 coder, using our proposed scheme, obtains significant improvement for the mismatch ratio of target bits and actual bits in all testing cases, achieves a visual quality improvement of about 0.6 dB on the average, performs better for buffer overflow and underflow,and achieves a similar or smaller PSNR deviation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast multiframe motion estimation algorithms by motion vector composition for the MPEG-4/AVC/H.264 standard

    Page(s): 478 - 487
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1136 KB) |  | HTML iconHTML  

    The MPEG-4/AVC/H.264 video coding standard adopts various coding schemes such as multiple reference frames and variable block sizes for motion estimation. Hence, MPEG-4/AVC/H.264 provides gains in compression efficiency of up to 50% over a wide range of bit rates and video resolutions compared to previous standards. However, these features result in a considerable increase in encoder complexity, mainly regarding to mode decision and motion estimation. The proposed algorithms use the stored motion vectors to compose the motion vector without performing the full search in each reference frame. Therefore, the proposed algorithms can obtain an average speed up ratio of four for encoding, thus benefiting from the prediction of the motion vector for the reference frames in advance and maintaining good performance. Any fast search algorithm can be utilized to further largely reduce the computational load. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A novel fractal image watermarking

    Page(s): 488 - 499
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2720 KB) |  | HTML iconHTML  

    A novel watermarking method is proposed to hide a binary watermark into image files compressed by fractal block coding. This watermarking method utilizes a special type of orthogonalization fractal coding method where the fractal affine transform is determined by the range block mean and contrast scaling. Such orthogonalization fractal decoding is a mean-invariant iteration. In contrast, the fractal parameters of classical fractal compression are very sensitive to any change of domain block pool and to common signal and geometric distortion. Hence, it is impossible to directly place a watermark in fractal parameters. The proposed watermark embedding procedure inserts a permutated pseudo-random binary sequence into the quantized range block means. The watermark is detected by computing the correlation coefficient between the original and the extracted watermark. Experimental results show that the proposed fractal watermarking scheme is robust against common signal and geometric distortion such as JPEG compression, low-pass filtering, rescaling, and clipping. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Recognition of facial expressions and measurement of levels of interest from video

    Page(s): 500 - 508
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (600 KB)  

    This paper presents a spatio-temporal approach in recognizing six universal facial expressions from visual data and using them to compute levels of interest. The classification approach relies on a two-step strategy on the top of projected facial motion vectors obtained from video sequences of facial expressions. First a linear classification bank was applied on projected optical flow vectors and decisions made by the linear classifiers were coalesced to produce a characteristic signature for each universal facial expression. The signatures thus computed from the training data set were used to train discrete hidden Markov models (HMMs) to learn the underlying model for each facial expression. The performances of the proposed facial expressions recognition were computed using five fold cross-validation on Cohn-Kanade facial expressions database consisting of 488 video sequences that includes 97 subjects. The proposed approach achieved an average recognition rate of 90.9% on Cohn-Kanade facial expressions database. Recognized facial expressions were mapped to levels of interest using the affect space and the intensity of motion around apex frame. Computed level of interest was subjectively analyzed and was found to be consistent with "ground truth" information in most of the cases. To further illustrate the efficacy of the proposed approach, and also to better understand the effects of a number of factors that are detrimental to the facial expression recognition, a number of experiments were conducted. The first empirical analysis was conducted on a database consisting of 108 facial expressions collected from TV broadcasts and labeled by human coders for subsequent analysis. The second experiment (emotion elicitation) was conducted on facial expressions obtained from 21 subjects by showing the subjects six different movies clips chosen in a manner to arouse spontaneous emotional reactions that would produce natural facial expressions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modeling individual and group actions in meetings with layered HMMs

    Page(s): 509 - 520
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (728 KB) |  | HTML iconHTML  

    We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently group-based (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a two-layer process, one that models basic individual activities from low-level audio-visual (AV) features,and another one that models the interactions. We propose a two-layer hidden Markov model (HMM) framework that implements such concept in a principled manner, and that has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on low-dimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMMs can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public 5-hour meeting corpus. Experiments and comparison with a single-layer HMM baseline system show its validity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Merging artificial objects with marker-less video sequences based on the interacting multiple model method

    Page(s): 521 - 528
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (784 KB)  

    Inserting synthetic objects into video sequences has gained much interest in recent years. Fast and robust vision-based algorithms are necessary to make such an application possible. Traditional pose tracking schemes using recursive structure from motion techniques adopt one Kalman filter and thus only favor a certain type of camera motion. We propose a robust simultaneous pose tracking and structure recovery algorithm using the interacting multiple model (IMM) to improve performance. In particular, a set of three extended Kalman filters (EKFs), each describing a frequently occurring camera motion in real situations (general, pure translation, pure rotation), is applied within the IMM framework to track the pose of a scene. Another set of EKFs,one filter for each model point, is used to refine the positions of the model features in the 3-D space. The filters for pose tracking and structure refinement are executed in an interleaved manner. The results are used for inserting virtual objects into the original video footage. The performance of the algorithm is demonstrated with both synthetic and real data. Comparisons with different approaches have been performed and show that our method is more efficient and accurate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Interactive dialogue model: a design technique for multichannel applications

    Page(s): 529 - 541
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2960 KB)  

    Multichannel applications deliver the same content and a "similar interactive experience" using different devices and different technologies(e.g., web sites, palm held devices, car navigators, or interactive TVs). Various channels imply a number of differences, including screen (size), keyboard(size), pointing devices, output devices, performances, and the context of use (standing, sitting, walking, etc.). In most cases, today, applications for different channels are designed and implemented almost "independently", with ineffectiveness for the developers (high costs) and ineffectiveness for the users (loss of consistency across the different channels and the perception that they are "different applications"). This paper presents an interactive dialogue model (IDM), a novel design model specifically tailored for multichannel applications. The background research, moving from linguistic theories and practices, has led us to the development of a "channel-independent" design model (based on dialogue primitives). Design can start in a "conceptual", channel-independent fashion, and then proceed into a further "logical" design oriented toward specific channels of communication. Designing an interactive application in two steps (channel-independent first, and channel-dependent later) allows a number of advantages without making more cumbersome the overall design process. Beside the emphasis on multichannel, IDM has additional distinctive features: it is lightweight, providing a few set of primitives (and a simple graphic notation) which are easy to learn and teach. Moreover, it is suitable for brainstorming and generating ideas at early stage during design (or during the shift from requirements to design); finally, it is cost-effective (it requires little effort from designers) and modular (designers can take the part they wish, not being forced to "all or nothing"). IDM has been validated both in the academic and industry environments,providing excellent results so far. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Learning dynamic audio-visual mapping with input-output Hidden Markov models

    Page(s): 542 - 549
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (744 KB) |  | HTML iconHTML  

    In this paper, we formulate the problem of synthesizing facial animation from an input audio sequence as a dynamic audio-visual mapping. We propose that audio-visual mapping should be modeled with an input-output hidden Markov model, or IOHMM. An IOHMM is an HMM for which the output and transition probabilities are conditional on the input sequence. We train IOHMMs using the expectation-maximization(EM) algorithm with a novel architecture to explicitly model the relationship between transition probabilities and the input using neural networks. Given an input sequence, the output sequence is synthesized by the maximum likelihood estimation. Experimental results demonstrate that IOHMMs can generate natural and good-quality facial animation sequences from the input audio. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive online transmission of 3-D TexMesh using scale-space and visual perception analysis

    Page(s): 550 - 563
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4992 KB)  

    Efficient online visualization of three-dimensional (3-D) mesh, mapped with photo realistic texture, is essential for a variety of applications such as museum exhibits and medical images. In these applications synthetic texture or color per vertex loses authenticity and resolution. An image-based view dependent approach requires too much overhead to generate a 360° display for online applications. We propose using a mesh simplification algorithm based on scale-space analysis of the feature point distribution, combined with an associated visual perception analysis of the surface texture, to address the needs of adaptive online transmission of high quality 3-D objects. The premise of the proposed textured mesh (TexMesh)simplification, taking the human visual system into consideration, is the following: given limited bandwidth, texture quality in low feature density surfaces can be reduced, without significantly affecting human perception. The advantage of allocating higher bandwidth, and thus higher quality, to dense feature density surfaces, is to improve the overall visual fidelity. Statistics on feature point distribution and their associated texture fragments are gathered during preprocessing. Online transmission is based on these statistics,which can be retrieved in constant time. Using an initial estimated bandwidth,a scaled mesh is first transmitted. Starting from a default texture quality,we apply an efficient Harmonic Time Compensation Algorithm based on the current bandwidth and a time limit, to adaptively adjust the texture quality of the next fragment to be transmitted. Properties of the algorithm are proved. Experimental results show the usefulness of our approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Toward intelligent music information retrieval

    Page(s): 564 - 574
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (600 KB)  

    Efficient and intelligent music information retrieval is a very important topic of the 21st century. With the ultimate goal of building personal music information retrieval systems, this paper studies the problem of intelligent music information retrieval. Huron points out that since the preeminent functions of music are social and psychological, the most useful characterization would be based on four types of information: genre, emotion, style,and similarity. This paper introduces Daubechies Wavelet Coefficient Histograms (DWCH)for music feature extraction for music information retrieval. The histograms are computed from the coefficients of the db8 Daubechies wavelet filter applied to 3 s of music. A comparative study of sound features and classification algorithms on a dataset compiled by Tzanetakis shows that combining DWCH with timbral features (MFCC and FFT), with the use of multiclass extensions of support vector machine,achieves approximately 80% of accuracy, which is a significant improvement over the previously known result on this dataset. On another dataset the combination achieves 75% of accuracy. The paper also studies the issue of detecting emotion in music. Rating of two subjects in the three bipolar adjective pairs are used. The accuracy of around 70% was achieved in predicting emotional labeling in these adjective pairs. The paper also studies the problem of identifying groups of artists based on their lyrics and sound using a semi-supervised classification algorithm. Identification of artist groups based on the Similar Artist lists at All Music Guide is attempted. The semi-supervised learning algorithm resulted in nontrivial increases in the accuracy to more than 70%. Finally, the paper conducts a proof-of-concept experiment on similarity search using the feature set. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Precise pitch profile feature extraction from musical audio for key detection

    Page(s): 575 - 584
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1416 KB) |  | HTML iconHTML  

    The majority of pieces of music, including classical and popular music,are composed using music scales, such as keys. The key or the scale information of a piece provides important clues on its high level musical content, like harmonic and melodic context. Automatic key detection from music data can be useful for music classification, retrieval or further content analysis. Many researchers have addressed key finding from symbolically encoded music(MIDI); however, works for key detection in musical audio is still limited. Techniques for key detection from musical audio mainly consist of two steps:pitch extraction and key detection. The pitch feature typically characterizes the weights of presence of particular pitch classes in the music audio. In the existing approaches to pitch extraction, little consideration has been taken on pitch mistuning and interference of noisy percussion sounds in the audio signals, which inevitably affects the accuracy of key detection. In this paper, we present a novel technique of precise pitch profile feature extraction, which deals with pitch mistuning and noisy percussive sounds. The extracted pitch profile feature can characterize the pitch content in the signal more accurately than the previous techniques, thus lead to a higher key detection accuracy. Experiments based on classical and popular music data were conducted. The results showed that the proposed method has higher key detection accuracy than previous methods, especially for popular music with a lot of noisy drum sounds. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fusion of audio and motion information on HMM-based highlight extraction for baseball games

    Page(s): 585 - 599
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1256 KB) |  | HTML iconHTML  

    This paper aims to extract baseball game highlights based on audio-motion integrated cues. In order to better describe different audio and motion characteristics in baseball game highlights, we propose a novel representation method based on likelihood models. The proposed likelihood models measure the "likeliness" of low-level audio features and motion features to a set of predefined audio types and motion categories, respectively. Our experiments show that using the proposed likelihood representation is more robust than using low-level audio/motion features to extract the highlight. With the proposed likelihood models, we then construct an integrated feature representation by symmetrically fusing the audio and motion likelihood models. Finally, we employ a hidden Markov model (HMM) to model and detect the transition of the integrated representation for highlight segments. A series of experiments have been conducted on a 12-h video database to demonstrate the effectiveness of our proposed method and show that the proposed framework achieves promising results over a variety of baseball game sequences. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TCP smoothness and window adjustment strategy

    Page(s): 600 - 609
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (536 KB) |  | HTML iconHTML  

    We observe that even when the system throughput is relatively stable,end users of media-streaming applications do not necessarily experience smooth throughput, due to the unsynchronized window adjustments triggered by random congestion indications. We analyze and evaluate the negative impact of random window adjustments on smoothness, short-term fairness, and long-term fairness. We further propose an experimental congestion avoidance mechanism, namely TCP(α, β, γ, δ), based on coordinated window adjustments. The flow-level smoothness is enhanced significantly for media-streaming applications, without a cost on fairness and responsiveness. Responsiveness is even boosted when bandwidth is underutilized. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Service differentiated peer selection: an incentive mechanism for peer-to-peer media streaming

    Page(s): 610 - 621
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB) |  | HTML iconHTML  

    We propose a service differentiated peer selection mechanism for peer-to-peer media streaming systems. The mechanism provides flexibility and choice in peer selection to the contributors of the system, resulting in high quality streaming sessions. Free-riders are given limited options in peer selection,if any, and hence receive low quality streaming. The proposed incentive mechanism follows the characteristics of rank-order tournaments theory that considers only the relative performance of the players, and the top prizes are awarded to the winners of the tournament. Using rank-order tournaments, we analyze the behavior of utility maximizing users. Through simulation and wide-area measurement studies, we verify that the proposed incentive mechanism can provide near optimal streaming quality to the cooperative users until the bottleneck shifts from the streaming sources to the network. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adding lossless video compression to MPEGs

    Page(s): 622 - 625
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (160 KB) |  | HTML iconHTML  

    In this correspondence, we propose to add a lossless compression functionality into existing MPEGs by developing a new context tree to drive arithmetic coding for lossless video compression. In comparison with the existing work on context tree design, the proposed algorithm features in 1) prefix sequence matching to locate the statistics model at the internal node nearest to the stopping point, where successful match of context sequence is broken; 2) traversing the context tree along a fixed order of context structure with a maximum number of four motion compensated errors; and 3) context thresholding to quantize the higher end of error values into a single statistics cluster. As a result,the proposed algorithm is able to achieve competitive processing speed, low computational complexity and high compression performances, which bridges the gap between universal statistics modeling and practical compression techniques. Extensive experiments show that the proposed algorithm outperforms JPEG-LS by up to 24% and CALIC by up to 22%, yet the processing time ranges from less than 2 seconds per frame to 6 seconds per frame on a typical PC computing platform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonlinear collusion attack on a watermarking scheme for buyer authentication

    Page(s): 626 - 629
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB) |  | HTML iconHTML  

    This paper presents an adaptive collusion attack on a buyer authentication watermarking scheme. To accomplish this attack, the traitors (i.e., dishonest buyers) select the pixels of their watermarked images generated from the same original image and average the selected pixels so as to remove the watermark information. Additionally, the forged image is of higher quality than any watermarked image. Both theoretical and experimental results demonstrate that our attack is very effective. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cryptanalysis of Chu's DCT based watermarking scheme

    Page(s): 629 - 632
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (296 KB) |  | HTML iconHTML  

    In 2003, Chu proposed an oblivious watermarking algorithm by modifying the CKLS scheme proposed by Cox, Kilian, Leighton, and Shamoon in 1997, known as the CKLS scheme. In this correspondence, we report that the modification presented by Chu is susceptible to a suitably modified attack devised by Das and Maitra in 2004. In fact, the experimental results show that Chu's scheme is even weaker than the CKLS scheme in terms of our attack. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An effective music information retrieval method using three-dimensional continuous DP

    Page(s): 633 - 639
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (480 KB) |  | HTML iconHTML  

    This paper describes a music information retrieval system that uses humming as the key for retrieval. Humming is an easy way for a user to input a melody. However, there are several problems with humming that degrade the retrieval of information. One problem is the human factor. Sometimes, people do not sing accurately, especially if they are inexperienced or unaccompanied. Another problem arises from signal processing. Therefore, a music information retrieval method should be sufficiently robust to surmount various humming errors and signal processing problems. A retrieval system has to extract the pitch from the user's humming. However, pitch extraction is not perfect. It often captures half or double pitches, which are harmonic frequencies of the true pitch, even if the extraction algorithms take the continuity of the pitch into account. Considering these problems, we propose a system that takes multiple pitch candidates into account. In addition to the frequencies of the pitch candidates, the confidence measures obtained from their powers are taken into consideration as well. We also propose the use of an algorithm with three dimensions that is an extension of the conventional Dynamic Programming (DP)algorithm, so that multiple pitch candidates can be treated. Moreover, in the proposed algorithm, DP paths are changed dynamically to take deltaPitches and IOIratios (inter-onset-interval) of input and reference notes into account in order to treat notes being split or unified. We carried out an evaluation experiment to compare the proposed system with a conventional system . When using three-pitch candidates with conference measure and IOI features, the top-ten retrieval accuracy was 94.1%. Thus, the proposed method gave a better retrieval performance than the conventional system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Multimedia Edics

    Page(s): 640
    Save to Project icon | Request Permissions | PDF file iconPDF (14 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Multimedia information for authors

    Page(s): 641 - 642
    Save to Project icon | Request Permissions | PDF file iconPDF (50 KB)  
    Freely Available from IEEE

Aims & Scope

The scope of the Periodical is the various aspects of research in multimedia technology and applications of multimedia.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Chang Wen Chen
State University of New York at Buffalo