By Topic

Multimedia, IEEE Transactions on

Issue 3 • Date June 2011

Filter Results

Displaying Results 1 - 22 of 22
  • Table of contents

    Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (49 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Multimedia publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (37 KB)  
    Freely Available from IEEE
  • Introduction to the ICME2010 Special Issue

    Page(s): 417 - 420
    Save to Project icon | Request Permissions | PDF file iconPDF (1187 KB)  
    Freely Available from IEEE
  • Moving Region Segmentation From Compressed Video Using Global Motion Estimation and Markov Random Fields

    Page(s): 421 - 431
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2273 KB) |  | HTML iconHTML  

    In this paper, we propose an unsupervised segmentation algorithm for extracting moving regions from compressed video using global motion estimation (GME) and Markov random field (MRF) classification. First, motion vectors (MVs) are compensated from global motion and quantized into several representative classes, from which MRF priors are estimated. Then, a coarse segmentation map of the MV field is obtained using a maximum a posteriori estimate of the MRF label process. Finally, the boundaries of segmented moving regions are refined using color and edge information. The algorithm has been validated on a number of test sequences, and experimental results are provided to demonstrate its advantages over state-of-the-art methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Event-Based Semantic Image Adaptation for User-Centric Mobile Display Devices

    Page(s): 432 - 442
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1045 KB) |  | HTML iconHTML  

    This paper proposes a semantic image adaptation scheme for heterogeneous mobile display devices. This scheme aims to provide mobile users with the most desired image content by integrating the content semantic importance with user preferences under limited mobile display constraints. The main contributions of the proposed scheme are: 1) seamless integration of mobile user supplied query information with low level image features to identify semantically important image contents; 2) integration of semantic importance and user feedback to dynamically update mobile user preferences; and 3) perceptually optimized adaptation for image display on mobile devices. In order to bridge the semantic gap for adaptation, we design a Bayesian fusion approach to properly integrate low level features with high level semantics. To accommodate the variation of user preferences, the system involves mobile users in the adaptation process with only a few simple feedbacks so as to present to the users most interesting content on mobile devices. Eventually, perceptually optimized adaptation is performed to present the best image content for mobile users according to mobile display capacities. Extensive experiments have been carried out based on several common events defined in Kodak's consumer image database. These experiments show that by utilizing the proposed semantic adaptation scheme with integration of the semantics and mobile user preferences, perceptually relevant adaptation can be effectively carried out to tailor the image towards user intentions under the mobile environment constraints. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exposing Digital Image Forgeries by Detecting Discrepancies in Motion Blur

    Page(s): 443 - 452
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1301 KB) |  | HTML iconHTML  

    The widespread availability of photo manipulation software has made it unprecedentedly easy to manipulate images for malicious purposes. Image splicing is one such form of tampering. In recent years, researchers have proposed various methods for detecting such splicing. In this paper, we present a novel method of detecting splicing in images, using discrepancies in motion blur. We use motion blur estimation through image gradients in order to detect inconsistencies between the spliced region and the rest of the image. We also develop a new measure to assist in inconsistent region segmentation in images that contain small amounts of motion blur. Experimental results show that our technique provides good segmentation of regions with inconsistent motion blur. We also provide quantitative comparisons with other existing blur-based techniques over a database of images. It is seen that our technique gives significantly better detection results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Depth Image-Based Rendering With Advanced Texture Synthesis for 3-D Video

    Page(s): 453 - 465
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2014 KB) |  | HTML iconHTML  

    A depth image-based rendering (DIBR) approach with advanced inpainting methods is presented. The DIBR algorithm can be used in 3-D video applications to synthesize a number of different perspectives of the same scene, e.g., from a multiview-video-plus-depth (MVD) representation. This MVD format consists of video and depth sequences for a limited number of original camera views of the same natural scene. Here, DIBR methods allow the computation of additional new views. An inherent problem of the view synthesis concept is the fact that image information which is occluded in the original views may become visible, especially in extrapolated views beyond the viewing range of the original cameras. The presented algorithm synthesizes these occluded textures. The synthesizer achieves visually satisfying results by taking spatial and temporal consistency measures into account. Detailed experiments show significant objective and subjective gains of the proposed method in comparison to the state-of-the-art methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ConnectBoard: Enabling Genuine Eye Contact and Accurate Gaze in Remote Collaboration

    Page(s): 466 - 473
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1144 KB) |  | HTML iconHTML  

    Conventional telepresence systems allow remote users to see one another and interact with shared media, but users cannot make eye contact, and gaze awareness with respect to shared media and documents is lost. In this paper, we describe a remote collaboration system based on a see-through display to create an experience where local and remote users are seemingly separated only by a vertical sheet of glass. Users can see each other and media displayed on the shared surface. Face detectors are applied on the local and remote video streams to introduce an offset in the video display so as to bring the local user's face, the local camera, and the remote user's face image into collinearity. This ensures that, when the local user looks at the remote user's image, the camera behind the see-through display captures an image with the “Mona Lisa effect,” where the eyes of an image appears to follow the viewer. Experiments show that, for one-on-one meetings, our system is capable of capturing and delivering realistic, genuine eye contact as well as accurate gaze awareness with respect to shared media. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Multi-Gesture Interaction System Using a 3-D Iris Disk Model for Gaze Estimation and an Active Appearance Model for 3-D Hand Pointing

    Page(s): 474 - 486
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1694 KB)  

    In this paper, we present a vision-based human-computer interaction system, which integrates control components using multiple gestures, including eye gaze, head pose, hand pointing, and mouth motions. To track head, eye, and mouth movements, we present a two-camera system that detects the face from a fixed, wide-angle camera, estimates a rough location for the eye region using an eye detector based on topographic features, and directs another active pan-tilt-zoom camera to focus in on this eye region. We also propose a novel eye gaze estimation approach for point-of-regard (POR) tracking on a viewing screen. To allow for greater head pose freedom, we developed a new calibration approach to find the 3-D eyeball location, eyeball radius, and fovea position. Moreover, in order to get the optical axis, we create a 3-D iris disk by mapping both the iris center and iris contour points to the eyeball sphere. We then rotate the fovea accordingly and compute the final, visual axis gaze direction. This part of the system permits natural, non-intrusive, pose-invariant POR estimation from a distance without resorting to infrared or complex hardware setups. We also propose and integrate a two-camera hand pointing estimation algorithm for hand gesture tracking in 3-D from a distance. The algorithms of gaze pointing and hand finger pointing are evaluated individually, and the feasibility of the entire system is validated through two interactive information visualization applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Touch Interface Exploiting Time-Frequency Classification Using Zak Transform for Source Localization on Solids

    Page(s): 487 - 497
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1231 KB) |  | HTML iconHTML  

    We propose a new approach to the development of a touch interface using surface-mounted sensors which allows one to convert a hard surface into a touch pad. This is achieved by using location template matching (LTM), a source localization algorithm that is robust to dispersion and multipath. In this interdisciplinary research, we employ mechanical vibration theories that model wave propagation of the flexural modes of vibration generated by an impact on the surface. We then verify that the amplitude variance across time for each propagating mode frequency is unique to each location on a surface. We show that the Zak transform allows us to faithfully track these amplitude variations and we exploit the uniqueness of this variance as a time-frequency classifier which in turn allows us to localize a finger tap in the context of a human-computer interface. The performance of the proposed algorithm is compared with existing LTM approaches on real surfaces. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sensitivity Analysis of the Human Visual System for Depth Cues in Stereoscopic 3-D Displays

    Page(s): 498 - 506
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (906 KB) |  | HTML iconHTML  

    Three-dimensional (3-D) displays provide a more realistic experience of entertainment by providing its viewers an added sensation of depth by artificially exploiting light rays to stimulate certain depth cues in the human visual system, especially binocular stereopsis. Due to its close relationship with human visual perception, mass market deployment of 3-D displays will be significantly dependant upon addressing the related perceptual factors such as visual comfort. In order to address the perceptual factors, it is very important to understand how humans experience depth on 3-D displays and how sensitive they are for different depth cues. In this paper, the sensitivity of humans for different depth cues is analyzed as applicable to 3-D viewing on stereoscopic displays. Mathematical models are derived to explain the just noticeable difference in depth (JNDD) for three different depth cues, namely binocular disparity, retinal blur, and relative size. Extensive subjective assessments are performed on a stereoscopic display with passive polarized glasses and on an auto-stereoscopic display to validate the mathematical models for JNDD. It is expected that the proposed models will have important use cases in 3-D display designing as well as 3-D content production. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast Action Detection via Discriminative Random Forest Voting and Top-K Subvolume Search

    Page(s): 507 - 517
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2323 KB) |  | HTML iconHTML  

    Multiclass action detection in complex scenes is a challenging problem because of cluttered backgrounds and the large intra-class variations in each type of actions. To achieve efficient and robust action detection, we characterize a video as a collection of spatio-temporal interest points, and locate actions via finding spatio-temporal video subvolumes of the highest mutual information score towards each action class. A random forest is constructed to efficiently generate discriminative votes from individual interest points, and a fast top-K subvolume search algorithm is developed to find all action instances in a single round of search. Without significantly degrading the performance, such a top-K search can be performed on down-sampled score volumes for more efficient localization. Experiments on a challenging MSR Action Dataset II validate the effectiveness of our proposed multiclass action detection method. The detection speed is several orders of magnitude faster than existing methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cost-Sensitive Multi-Label Learning for Audio Tag Annotation and Retrieval

    Page(s): 518 - 529
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (950 KB) |  | HTML iconHTML  

    Audio tags correspond to keywords that people use to describe different aspects of a music clip. With the explosive growth of digital music available on the Web, automatic audio tagging, which can be used to annotate unknown music or retrieve desirable music, is becoming increasingly important. This can be achieved by training a binary classifier for each tag based on the labeled music data. Our method that won the MIREX 2009 audio tagging competition is one of this kind of methods. However, since social tags are usually assigned by people with different levels of musical knowledge, they inevitably contain noisy information. By treating the tag counts as costs, we can model the audio tagging problem as a cost-sensitive classification problem. In addition, tag correlation information is useful for automatic audio tagging since some tags often co-occur. By considering the co-occurrences of tags, we can model the audio tagging problem as a multi-label classification problem. To exploit the tag count and correlation information jointly, we formulate the audio tagging task as a novel cost-sensitive multi-label (CSML) learning problem and propose two solutions to solve it. The experimental results demonstrate that the new approach outperforms our MIREX 2009 winning method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effective Semantic Annotation by Image-to-Concept Distribution Model

    Page(s): 530 - 538
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1068 KB) |  | HTML iconHTML  

    Image annotation based on visual features has been a difficult problem due to the diverse associations that exist between visual features and human concepts. In this paper, we propose a novel approach called Annotation by Image-to-Concept Distribution Model (AICDM) for image annotation by discovering the associations between visual features and human concepts from image-to-concept distribution. Through the proposed image-to-concept distribution model, visual features and concepts can be bridged to achieve high-quality image annotation. In this paper, we propose to use “visual features”, “models”, and “visual genes” which represent analogous functions to the biological chromosome, DNA, and gene. Based on the proposed models using entropy, tf-idf, rules, and SVM, the goal of high-quality image annotation can be achieved effectively. Our empirical evaluation results reveal that the AICDM method can effectively alleviate the problem of visual-to-concept diversity and achieve better annotation results than many existing state-of-the-art approaches in terms of precision and recall. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Editing by Viewing: Automatic Home Video Summarization by Viewing Behavior Analysis

    Page(s): 539 - 550
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1488 KB) |  | HTML iconHTML  

    In this paper, we propose the Interest Meter (IM), a system making the computer conscious of user's reactions to measure user's interest and thus use it to conduct video summarization. The IM takes account of users' spontaneous reactions when they view videos. To estimate user's viewing interest, quantitative interest measures are devised based on the perspectives of attention and emotion. For estimating attention states, variations of user's eye movement, blink, and head motion are considered. For estimating emotion states, facial expression is recognized as positive or neural emotion. By combining characteristics of attention and emotion by a fuzzy fusion scheme, we transform users' viewing behaviors into quantitative interest scores, determine interesting parts of videos, and finally concatenate them as video summaries. Experimental results show that the proposed concept “editing by viewing” works well and may provide a promising direction to consider the human factor in video summarization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Layer-Aware Forward Error Correction for Mobile Broadcast of Layered Media

    Page(s): 551 - 562
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1587 KB) |  | HTML iconHTML  

    The bitstream structure of layered media formats such as scalable video coding (SVC) or multiview video coding (MVC) opens up new opportunities for their distribution in Mobile TV services. Features like graceful degradation or the support of the 3-D experience in a backwards-compatible way are enabled. The reason is that parts of the media stream are more important than others with each part itself providing a useful media representation. Typically, the decoding of some parts of the bitstream is only possible, if the corresponding more important parts are correctly received. Hence, unequal error protection (UEP) can be applied protecting important parts of the bitstream more strongly than others. Mobile broadcast systems typically apply forward error correction (FEC) on upper layers to cope with transmission errors, which the physical layer FEC cannot correct. Today's FEC solutions are optimized to transmit single layer video. The exploitation of the dependencies in layered media codecs for UEP using FEC is the subject of this paper. The presented scheme, which is called layer-aware FEC (LA-FEC), incorporates the dependencies of the layered video codec into the FEC code construction. A combinatorial analysis is derived to show the potential theoretical gain in terms of FEC decoding probability and video quality. Furthermore, the implementation of LA-FEC as an extension of the Raptor FEC and the related signaling are described. The performance of layer-aware Raptor code with SVC is shown by experimental results in a DVB-H environment showing significant improvements achieved by LA-FEC. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Flash Translation Layer for nand Flash-Based Multimedia Storage Devices

    Page(s): 563 - 572
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1069 KB) |  | HTML iconHTML  

    NAND flash memory-based storage devices are becoming an attractive storage solution for multimedia storage servers because they afford several advantages including fast read access speed and low power consumption. Multimedia storage devices differ significantly from traditional storage devices in that the user files are accessed in a sequential read manner while metadata are updated frequently. In this paper, we propose a new flash translation layer (FTL) scheme called filtering FTL (FFTL) for nand flash-based multimedia storage devices. The main idea of the FFTL scheme is to filter update requests for metadata and manage them separately from requests for user data. Our experimental results show that the proposed FFTL scheme outperforms existing FTL schemes by dramatically reducing the garbage collection overhead. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High-Quality Visualization for Geographically Distributed 3-D Teleimmersive Applications

    Page(s): 573 - 584
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2038 KB)  

    The growing popularity of 3-D movies has led to the rapid development of numerous affordable consumer 3-D displays. In contrast, the development of technology to generate 3-D content has lagged behind considerably. In spite of significant improvements to the quality of imaging devices, the accuracy of the algorithms that generate 3-D data, and the hardware available to render such data, the algorithms available to calibrate, reconstruct, and then visualize such data remain difficult to use, extremely noise sensitive, and unreasonably slow. In this paper, we present a multi-camera system that creates a highly accurate (on the order of a centimeter), 3-D reconstruction of an environment in real-time (under 30 ms) that allows for remote interaction between users. This paper focuses on addressing the aforementioned deficiencies by describing algorithms to calibrate, reconstruct, and render objects in the system. We demonstrate the accuracy and speed of our results on a variety of benchmarks and data collected from our own system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Multimedia EDICS

    Page(s): 585
    Save to Project icon | Request Permissions | PDF file iconPDF (16 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Multimedia Information for authors

    Page(s): 586 - 587
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • Special issue on fundamental technologies in modern speech recognition

    Page(s): 588
    Save to Project icon | Request Permissions | PDF file iconPDF (108 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Multimedia society information

    Page(s): C3
    Save to Project icon | Request Permissions | PDF file iconPDF (28 KB)  
    Freely Available from IEEE

Aims & Scope

The scope of the Periodical is the various aspects of research in multimedia technology and applications of multimedia.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Chang Wen Chen
State University of New York at Buffalo