By Topic

Multimedia, IEEE Transactions on

Issue 7 • Date Nov. 2013

Filter Results

Displaying Results 1 - 25 of 29
  • Table of contents

    Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (156 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Multimedia publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (130 KB)  
    Freely Available from IEEE
  • Emotional Accompaniment Generation System Based on Harmonic Progression

    Page(s): 1469 - 1479
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1567 KB) |  | HTML iconHTML  

    A music piece consists of melody and accompaniment in many genres. In this paper, we present a system to automatically generate accompaniment that evokes specific emotions for a given melody. In particular, we propose harmonic progression and onset rate as two key features for emotion-based accompaniment generation. The former refers to the progression of chords, and the latter refers to the number of music events (such as notes and drums) in a unit time. The harmonic progression and the onset rate are altered according to the specified emotion represented by the valence and arousal parameters, respectively. The performance of the system is evaluated subjectively, and the result shows a perfect positive Spearman correlation between the specified emotion and the perceived emotion. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Aesthetic Image Enhancement by Dependence-Aware Object Recomposition

    Page(s): 1480 - 1490
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2413 KB) |  | HTML iconHTML  

    This paper proposes an image-enhancement method to optimize photograph composition by rearranging foreground objects in the photograph. To adjust objects' positions while keeping the original scene content, we first perform a novel structure dependence analysis on the image to obtain the dependencies between all background regions. To determine the optimal positions for foreground objects, we formulate an optimization problem based on widely used heuristics for aesthetically pleasing pictures. Semantic relations between foreground objects are also taken into account during optimization. The final output is produced by moving foreground objects, together with their dependent regions, to optimal positions. The results show that our approach can effectively optimize photographs with single or multiple foreground objects without compromising the original photograph content. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • JPIP Proxy Server With Prefetching Strategies Based on User-Navigation Model and Semantic Map

    Page(s): 1491 - 1502
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1940 KB) |  | HTML iconHTML  

    The efficient transmission of large resolution images and, in particular, the interactive transmission of images in a client-server scenario, is an important aspect for many applications. Among the current image compression standards, JPEG2000 excels for its interactive transmission capabilities. In general, three mechanisms are employed to optimize the transmission of images when using the JPEG2000 Interactive Protocol (JPIP): 1) packet re-sequencing at the server; 2) prefetching at the client; and 3) proxy servers along the network infrastructure. To avoid the congestion of the network, prefetching mechanisms are not commonly employed when many clients within a local area network (LAN) browse images from a remote server. Aimed to maximize the responsiveness of all the clients within a LAN, this work proposes the use of prefetching strategies at the proxy server-rather than at the clients. The main insight behind the proposed prefetching strategies is a user-navigation model and a semantic map that predict the future requests of the clients. Experimental results indicate that the introduction of these strategies into a JPIP proxy server enhances the browsing experience of the end-users notably. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Visually Favorable Tone-Mapping With High Compression Performance in Bit-Depth Scalable Video Coding

    Page(s): 1503 - 1518
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3818 KB) |  | HTML iconHTML  

    In bit-depth scalable video coding, the tone-mapping scheme used to convert high-bit-depth to eight-bit videos is an essential yet very often ignored component. In this paper, we demonstrate that an appropriate choice of a tone-mapping operator can improve the coding efficiency of bit-depth scalable encoders. We present a new tone-mapping scheme that delivers superior compression efficiency while adhering to a predefined base layer perceptual quality. We develop numerical models that estimate the base layer bit-rate (Rb), the enhancement layer bitrate (Re), and the mismatch (QL) between the resulting low dynamic range (LDR) base-layer signal and the predefined base layer representation. Our proposed tone curve is given by the solution of an optimization problem which minimizes a weighted sum of Rb, Re, and QL. The problem formulation also considers the temporal effect of tone-mapping by adding a constraint to the optimization problem that suppresses flickering artifacts. We also propose a technique with which to tone-map a high-bit-depth video directly in a compression-friendly color space (e.g., one luma and two chroma channels) without converting to the RGB domain. Experimental results show that we can save up to 40% of the total bit-rate (or 3.5 dB PSNR improvement for the same bitrate), and, in general, about 20% bit-rate savings can be achieved. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable Resource Allocation for SVC Video Streaming Over Multiuser MIMO-OFDM Networks

    Page(s): 1519 - 1531
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1825 KB) |  | HTML iconHTML  

    In this paper, we propose a scalable resource allocation framework for streaming scalable videos over multiuser multiple-input multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) networks. We exploit the utilities of scalable videos produced by the scalable extension of H.264/AVC (SVC) and investigate the multidimensional diversities of the multiuser MIMO-OFDM wireless networks. First, we study the rate-utility relationship of SVC via a packet prioritization scheme. Based on the rate-utility analysis, a scalable resource-allocation framework is proposed to achieve differentiated service objectives for different scalable video layers. To provide users with fair opportunities to acquire basic viewing experience, a fair scheme is designed to guarantee that each user is entitled to a MAXMIN fairness to have their base layer video packets received. After all users have their base layer packets successfully scheduled, resources are distributed to exploit the network efficiency. The two schemes are integrated into a unified bit loading and power allocation solution to enhance the practicability of the scalable framework. Experiment results confirms that the proposed scheme handles fairness and efficiency better at different scenarios than the conventional schemes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simplification Resilient LDPC-Coded Sparse-QIM Watermarking for 3D-Meshes

    Page(s): 1532 - 1542
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1723 KB) |  | HTML iconHTML  

    We propose a blind watermarking scheme for 3D meshes that combines sparse quantization index modulation (QIM) with deletion correction codes. The QIM operates on the vertices in rough concave regions of the surface thus ensuring impeccability, while the deletion correction code recovers the data hidden in the vertices, which is removed by mesh optimization and/or simplification. The proposed scheme offers two orders of magnitude better performance in terms of recovered watermark bit error rate compared to the existing schemes of similar payloads and fidelity constraints. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Face Expression Recognition by Cross Modal Data Association

    Page(s): 1543 - 1552
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1516 KB) |  | HTML iconHTML  

    We present a novel facial expression recognition framework using audio-visual information analysis. We propose to model the cross-modality data correlation while allowing them to be treated as asynchronous streams. We also show that our framework can improve the recognition performance while significantly reducing the computational cost by avoiding redundant or insignificant frame processing by incorporating auditory information. In particular, we design a single good image representation of image sequence by weighted sums of registered face images where the weights are derived using auditory features. We use a still image based technique for the expression recognition task. Our framework, however, can be generalized to work with dynamic features as well. We performed experiments using eNTERFACE'05 audio-visual emotional database containing six archetypal emotion classes: Happy, Sad, Surprise, Fear, Anger and Disgust. We present one-to-one binary classification as well as multi-class classification performances evaluated using both subject dependent and independent strategies. Furthermore, we compare multi-class classification accuracies with those of previously published literature which use the same database. Our analyses show promising results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention

    Page(s): 1553 - 1568
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2632 KB) |  | HTML iconHTML  

    Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual, and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color, and orientation. Textual or linguistic saliency is extracted from part-of-speech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided across modalities. The produced summaries, based on low-level features and content-independent fusion and selection, are of subjectively high aesthetic and informative quality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Learning to Produce 3D Media From a Captured 2D Video

    Page(s): 1569 - 1578
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1963 KB) |  | HTML iconHTML  

    Due to the advances in display technologies and commercial success of 3D motion pictures in recent years, there is renewed interest in enabling consumers to create 3D content. While new 3D content can be created using more advanced capture devices (i.e., stereo cameras), most people still own 2D capture devices. Further, enormously large collections of captured media exist only in 2D. We present a system for producing pseudo-stereo images from captured 2D videos. Our system employs a two-phase procedure where the first phase detects “good” pseudo-stereo images frames from a 2D video, which was captured a priori without any constraints on camera motion or content. We use a trained classifier to detect pairs of video frames that are suitable for constructing pseudo-stereo images. In particular, for a given frame at time t, we determine if exists such that It+t̅ and It can form an acceptable pseudo-stereo image. Moreover, even if t̂ is determined, generating a good pseudo-stereo image from 2D captured video frames can be nontrivial since in many videos, professional or amateur, both foreground and background objects may undergo complex motion. Independent foreground motions from different scene objects define different epipolar geometries that cause the conventional method of generating pseudo-stereo images to fail. To address this problem, the second phase of the proposed system further recomposes the frame pairs to ensure consistent 3D perception for objects for such cases. In this phase, final left and right pseudo-stereo images are created by recompositing different regions of the initial frame pairs to ensure a consistent camera geometry. We verify the performance of our method for producing pseudo-stereo media from captured 2D videos in a psychovisual evaluation using both professional movie clips and amateur home videos. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Energy and Quality-Aware Multimedia Signal Processing

    Page(s): 1579 - 1593
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2559 KB) |  | HTML iconHTML  

    This paper presents techniques to reduce energy with minimal degradation in system performance for multimedia signal processing algorithms. It first provides a survey of energy-saving techniques such as those based on voltage scaling, reducing number of computations and reducing dynamic range. While these techniques reduce energy, they also introduce errors that affect the performance quality. To compensate for these errors, techniques that exploit algorithm characteristics are presented. Next, several hybrid energy-saving techniques that further reduce the energy consumption with low performance degradation are presented. For instance, a combination of voltage scaling and dynamic range reduction is shown to achieve 85% energy saving in a low pass FIR filter for a fairly low noise level. A combination of computation reduction and dynamic reduction for Discrete Cosine Transform shows, on average, 33% to 46% reduction in energy consumption while incurring 0.5 dB to 1.5 dB loss in PSNR. Both of these techniques have very little overhead and achieve significant energy reduction with little quality degradation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Style Transfer Via Image Component Analysis

    Page(s): 1594 - 1601
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1838 KB) |  | HTML iconHTML  

    Example-based stylization provides an easy way of making artistic effects for images and videos. However, most existing methods do not consider the content and style separately. In this paper, we propose a style transfer algorithm via a novel component analysis approach, based on various image processing techniques. First, inspired by the steps of drawing a picture, an image is decomposed into three components: draft, paint and edge, which describe the content, main style, and strengthened strokes along the boundaries. Then the style is transferred from the template image to the source image in the paint and edge components. Style transfer is formulated as a global optimization problem by using Markov random fields, and a coarse-to-fine belief propagation algorithm is used to solve the optimization problem. To combine the draft component and the obtained style information, the final artistic result can be achieved via a reconstruction step. Compared to other algorithms, our method not only synthesizes the style, but also preserves the image content well. We also extend our algorithm from single image stylization to video personalization, by maintaining the temporal coherence and identifying faces in video sequences. The results indicate that our approach performs excellently in stylization and personalization for images and videos. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Tracking Human Under Occlusion Based on Adaptive Multiple Kernels With Projected Gradients

    Page(s): 1602 - 1615
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3244 KB) |  | HTML iconHTML  

    Kernel based trackers have been proven to be a promising approach for video object tracking. The use of a single kernel often suffers from occlusion since the available visual information is not sufficient for kernel usage. In order to provide more robust tracking performance, multiple inter-related kernels have thus been utilized for tracking in complicated scenarios. This paper presents an innovative method, which uses projected gradient to facilitate multiple kernels, in finding the best match during tracking under predefined constraints. The adaptive weights are applied to the kernels in order to efficiently compensate the adverse effect introduced by occlusion. An effective scheme is also incorporated to deal with the scale change issue during the object tracking. Moreover, we embed the multiple-kernel tracking into a Kalman filtering-based tracking system to enable fully automatic tracking. Several simulation results have been done to show the robustness of the proposed multiple-kernel tracking and also demonstrate that the overall system can successfully track the video objects under occlusion. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Preserving Motion-Tolerant Contextual Visual Saliency for Video Resizing

    Page(s): 1616 - 1627
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3133 KB) |  | HTML iconHTML  

    State of the art video resizing methods usually produce perceivable visual discontinuities, especially in videos containing significant motion. To resolve the problem, contextual information about the focus of interest in consecutive video frames should be considered in order to preserve the visual continuity. In this paper, to detect the focus of interest with motion-tolerance, we propose a novel approach for modelling visual dynamics based on spatiotemporal slices (STS), which provide rich visual patterns along a large temporal scale. First, patch-based visual patterns are computed to generate a codebook of the automatically specified spatiotemporal extent determined by the contextual information in the STS. The codebook is then used to compute its associated response in each video frame, and eventually an importance map covering the focus of interest in a video clip can be obtained. To preserve the visual continuity of the content, particularly an important area, a multi-cue approach is used to guide a mesh-based non-homogeneous warping operation constrained by the trajectories in the STS. For the performance evaluation, we present a novel measure that utilizes patch-based Kullback-Leibler divergence (KL-divergence) to gauge the deformation of the focus of interest under the proposed video resizing approach. Experimental results show that the STS-based approach can generate retargeted videos effectively, while maintaining their isotropic manipulation and the continuous dynamics of visual perception. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multimedia Event Detection Using A Classifier-Specific Intermediate Representation

    Page(s): 1628 - 1637
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2057 KB) |  | HTML iconHTML  

    Multimedia event detection (MED) plays an important role in many applications such as video indexing and retrieval. Current event detection works mainly focus on sports and news event detection or abnormality detection in surveillance videos. Differently, our research aims to detect more complicated and generic events within a longer video sequence. In the past, researchers have proposed using intermediate concept classifiers with concept lexica to help understand the videos. Yet it is difficult to judge how many and what concepts would be sufficient for the particular video analysis task. Additionally, obtaining robust semantic concept classifiers requires a large number of positive training examples, which in turn has high human annotation cost. In this paper, we propose an approach that exploits the external concepts-based videos and event-based videos simultaneously to learn an intermediate representation from video features. Our algorithm integrates the classifier inference and latent intermediate representation into a joint framework. The joint optimization of the intermediate representation and the classifier makes them mutually beneficial and reciprocal. Effectively, the intermediate representation and the classifier are tightly correlated. The classifier dependent intermediate representation not only accurately reflects the task semantics but is also more suitable for the specific classifier. Thus we have created a discriminative semantic analysis framework based on a tightly coupled intermediate representation. Extensive experiments on multimedia event detection using real-world videos demonstrate the effectiveness of the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Proxy-Based Multi-Stream Scalable Video Adaptation Over Wireless Networks Using Subjective Quality and Rate Models

    Page(s): 1638 - 1652
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2637 KB) |  | HTML iconHTML  

    Despite growing maturity in broadband mobile networks, wireless video streaming remains a challenging task, especially in highly dynamic environments. Rapidly changing wireless link qualities, highly variable round trip delays, and unpredictable traffic contention patterns often hamper the performance of conventional end-to-end rate adaptation techniques such as TCP-friendly rate control (TFRC). Furthermore, existing approaches tend to treat all flows leaving the network edge equally, without accounting for heterogeneity in the underlying wireless link qualities or the different rate utilities of the video streams. In this paper, we present a proxy-based solution for adapting the scalable video streams at the edge of a wireless network, which can respond quickly to highly dynamic wireless links. Our design adopts the recently standardized scalable video coding (SVC) technique for lightweight rate adaptation at the edge. Leveraging previously developed rate and quality models of scalable video with both temporal and amplitude scalability, we derive the rate-quality model that relates the maximum quality under a given rate by choosing the optimal frame rate and quantization stepsize. The proxy iteratively allocates rates of different video streams to maximize a weighted sum of video qualities associated with different streams, based on the periodically observed link throughputs and the sending buffer status. The temporal and amplitude layers included in each video are determined to optimize the quality while satisfying the rate assignment. Simulation studies show that our scheme consistently outperforms TFRC in terms of agility to track link qualities and overall subjective quality of all streams. In addition, the proposed scheme supports differential services for different streams, and competes fairly with TCP flows. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Robust and Scalable Visual Category and Action Recognition System Using Kernel Discriminant Analysis With Spectral Regression

    Page(s): 1653 - 1664
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2513 KB) |  | HTML iconHTML  

    Visual concept detection and action recognition are one of the most important tasks in content-based multimedia information retrieval (CBMIR) technology. It aims at annotating images using a vocabulary defined by a set of concepts of interest including scenes types (mountains, snow, etc.) or human actions (phoning, playing instrument). This paper describes our system in the ImageCLEF@ICPR10, Pascal VOC 08 Visual Concept Detection and Pascal VOC 10 Action Recognition Challenges. The proposed system ranked first in these large-scale tasks when evaluated independently by the organizers. The proposed system involves state-of-the-art local descriptor computation, vector quantization via clustering, structured scene or object representation via localized histograms of vector codes, similarity measure for kernel construction and classifier learning. The main novelty is the classifier-level and kernel-level fusion using Kernel Discriminant Analysis and Spectral Regression (SR-KDA) with RBF Chi-Squared kernels obtained from various image descriptors. The distinctiveness of the proposed method is also assessed experimentally using a video benchmark: the Mediamill Challenge along with benchmarks from ImageCLEF@ICPR10, Pascal VOC 10 and Pascal VOC 08. From the experimental results, it can be derived that the presented system consistently yields significant performance gains when compared with the state-of-the art methods. The other strong point is the introduction of SR-KDA in the classification stage where the time complexity scales linearly with respect to the number of concepts and the main computational complexity is independent of the number of categories. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Interaction Design for Mobile Visual Search

    Page(s): 1665 - 1676
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1468 KB) |  | HTML iconHTML  

    Mobile devices are becoming ubiquitous. People take pictures via their phone cameras to explore the world on the go. In many cases, they are concerned with the picture-related information. Understanding user intent conveyed by those pictures therefore becomes important. Existing mobile applications employ visual search to connect the captured picture with the physical world. However, they only achieve limited success due to the ambiguity nature of user intent in the picture-one picture usually contains multiple objects. By taking advantage of multitouch interactions on mobile devices, this paper presents a prototype of interactive mobile visual search, named TapTell, to help users formulate their visual intent more conveniently. This kind of search leverages limited yet natural user interactions on the phone to achieve more effective visual search while maintaining a satisfying user experience. We make three contributions in this work. First, we conduct a focus study on the usage patterns and concerned factors for mobile visual search, which in turn leads to the interactive design of expressing visual intent by gesture. Second, we introduce four modes of gesture-based interactions (crop, line, lasso, and tap) and develop a mobile prototype. Third, we perform an in-depth usability evaluation on these different modes, which demonstrates the advantage of interactions and shows that lasso is the most natural and effective interaction mode. We show that TapTell provides a natural user experience to use phone camera and gesture to explore the world. Based on the observation and conclusion, we also suggest some design principles for interactive mobile visual search in the future. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Context-Aware Video Retargeting via Graph Model

    Page(s): 1677 - 1687
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1846 KB) |  | HTML iconHTML  

    Video retargeting is a crowded but challenging research area. In order to maximally comfort the viewers' watching experience, the most challenging issue is how to retain the spatial shape of important objects while ensure temporal smoothness and coherence. Existing retargeting techniques deal with these spatial-temporal requirements individually, which preserve the spatial geometry and temporal coherence for each region. However, the spatial-temporal property of the video content should be context-relevant, i.e., the regions belonging to the same object are supposed to undergo uniform spatial-temporal transformation. Regardless of the contextual information, the divide-and-rule strategy of existing techniques usually incurs various spatial-temporal artifacts. In order to achieve satisfactory spatial-temporal coherent video retargeting, in this paper, a novel context-aware solution is proposed via graph model. First, we employ a grid-based warping framework to preserve the spatial structure and temporal motion trend at the unit of grid cell. Second, we propose a graph-based motion layer partition algorithm to estimate motions of different regions, which simultaneously provides the evaluation of contextual relationship between grid cells while estimating the motions of regions. Third, complementing the salience-based spatial-temporal information preservation, two novel context constraints are encoded for encouraging the grid cells of the same object to undergo uniform spatial and temporal transformation, respectively. Finally, we formulate the objective function as a quadratic programming problem. Our method achieves a satisfactory spatial-temporal coherence while maximally avoiding the influence of artifacts. In addition, the grid-cell-wise motion estimation could be calculated every few frames, which obviously improves the speed. Experimental results and comparisons with state-of-the-art methods demonstrate the effectiveness and efficiency of our approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On-Device Mobile Visual Location Recognition by Integrating Vision and Inertial Sensors

    Page(s): 1688 - 1699
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2183 KB) |  | HTML iconHTML  

    This paper deals with the problem of city scale on-device mobile visual location recognition by fusing the inertial sensors and computer vision techniques. The main contributions are as follows: Firstly, we design an efficient vector quantization strategy by combining the Transform Coding (TC) and Residual Vector Quantization (RVQ). Our method can compress a visual descriptor into only several bytes while providing reasonable searching accuracy, which makes the managing of city scale image database directly on mobile devices come true. Secondly, we integrate the information from inertial sensors into the Vector of Locally Aggregated Descriptors (VLAD) generation and image similarity evaluation processes. Our method is not only fast enough for on-device implementation, but it also can improve the location recognition accuracy obviously. Thirdly, we also release a set of 1.295 million geo-tagged street view images with the information from inertial sensors, as well as a difficult set of query images. These resources can be used as a new benchmark to facilitate further research in the area. Experimental results prove the validity of the proposed methods for on-device mobile visual location recognition applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reduced-Reference Image Quality Assessment With Visual Information Fidelity

    Page(s): 1700 - 1705
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2842 KB) |  | HTML iconHTML  

    Reduced-reference (RR) image quality assessment (IQA) aims to use less data about the reference image and achieve higher evaluation accuracy. Recent research on brain theory suggests that the human visual system (HVS) actively predicts the primary visual information and tries to avoid the residual uncertainty for image perception and understanding. Therefore, the perceptual quality relies to the information fidelities of the primary visual information and the residual uncertainty. In this paper, we propose a novel RR IQA index based on visual information fidelity. We advocate that distortions on the primary visual information mainly disturb image understanding, and distortions on the residual uncertainty mainly change the comfort of perception. We separately compute the quantities of the primary visual information and the residual uncertainty of an image. Then the fidelities of the two types of information are separately evaluated for quality assessment. Experimental results demonstrate that the proposed index uses few data (30 bits) and achieves high consistency with human perception. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Just Noticeable Difference Estimation for Images With Free-Energy Principle

    Page(s): 1705 - 1710
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2594 KB) |  | HTML iconHTML  

    In this paper, we introduce a novel just noticeable difference (JND) estimation model based on the unified brain theory, namely the free-energy principle. The existing pixel-based JND models mainly consider the orderly factors and always underestimate the JND threshold of the disorderly region. Recent research indicates that the human visual system (HVS) actively predicts the orderly information and avoids the residual disorderly uncertainty for image perception and understanding. Thus, we suggest that there exists disorderly concealment effect which results in high JND threshold of the disorderly region. Beginning with the Bayesian inference, we deduce an autoregressive model to imitate the active prediction of the HVS. Then, we estimate the disorderly concealment effect for the novel JND model. Experimental results confirm that the proposed JND model outperforms the relevant existing ones. Furthermore, we apply the proposed JND model in image compression, and around 15% of bit rate can be reduced without jeopardizing the perceptual quality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TMM EDICS

    Page(s): 1711
    Save to Project icon | Request Permissions | PDF file iconPDF (333 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Multimedia information for authors

    Page(s): 1712 - 1713
    Save to Project icon | Request Permissions | PDF file iconPDF (130 KB)  
    Freely Available from IEEE

Aims & Scope

The scope of the Periodical is the various aspects of research in multimedia technology and applications of multimedia.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Chang Wen Chen
State University of New York at Buffalo