By Topic

Multimedia, IEEE Transactions on

Issue 2 • Date April 2012

Filter Results

Displaying Results 1 - 25 of 28
  • Table of contents

    Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (154 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Multimedia publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (35 KB)  
    Freely Available from IEEE
  • An Advanced Hierarchical Motion Estimation Scheme With Lossless Frame Recompression and Early-Level Termination for Beyond High-Definition Video Coding

    Page(s): 237 - 249
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1451 KB) |  | HTML iconHTML  

    In this paper, we present a hardware-efficient fast algorithm with a lossless frame recompression scheme and early-level termination strategy for large search range (SR) motion estimation (ME) utilized in beyond high-definition video encoder. To achieve high ME quality for hierarchical motion search, we propose an advanced hierarchical ME scheme which processes the multiresolution motion search with an efficient refining stage. This enables high data and hardware reuse for much lower bandwidth and memory cost, while achieving higher ME quality than previous works. In addition, a lossless frame recompression scheme based on this ME algorithm is presented to further reduce bandwidth. A hierarchical memory organization as well as a leveling two-step data fetching strategy is applied to meet constraint of random access for hierarchical motion search structure. Also, the leveling compression strategy by allowing a lower level to refer to a higher one for compression is proposed to efficiently reduce the bandwidth. Furthermore, an early-level termination method suitable for hierarchical ME structure is also applied. This method terminates high-level redundant motion searches by establishing thresholds based on current block mode and motion search level; it also applies the early refinement termination in order to avoid unnecessary refinement for high levels. Experimental results show that the total scheme has a much lower bit rate increasing compared with previous works especially for high motion sequences, while achieving a considerable saving of memory and bandwidth cost for large SR of [-128,127]. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-Decoding-Latency Buffer Compression for Graphics Processing Units

    Page(s): 250 - 263
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1996 KB) |  | HTML iconHTML  

    Power consumption is the key design factor for graphics processing units (GPUs), especially for mobile applications. The increasing bandwidth required to produce more realistic graphics is a major power draw. To address this factor, in this paper, we present a new universal buffer compression method that can handle both color and depth data with the same hardware unit. In contrast to the current state-of-art technologies, which mainly focus on achieving higher and higher compression ratios but discarded the decompression latency, our method reaches a good compromise between the two, which are factors critical to system performance. With spatial prediction and bitstream rearrangement, the data dependencies between different samples are reduced, which enables a parallel decoding process and makes the proposed system have 6.78 times lower decoding latency. Moreover, by adopting a similar concept for the color/depth compression in the DXT5 texture compression method, better quality in terms of PSNR can be achieved without introducing any decoding latency when retrieving a texel. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient and Rate-Distortion Optimal Wavelet Packet Basis Selection in JPEG2000

    Page(s): 264 - 277
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2270 KB) |  | HTML iconHTML  

    This paper discusses optimal wavelet packet basis selection within JPEG2000. Algorithms for rate-distortion optimal wavelet packet basis selection in JPEG2000 are presented and compared to more efficient wavelet packet basis selection schemes. Both isotropic and anisotropic wavelet packet bases are considered. For the first time, computationally efficient heuristics are compared to the best bases in the standardized coding framework of JPEG2000. For the first time, the maximum performance gains of custom wavelet packets in JPEG2000 can be assessed. The algorithms are evaluated on a wide range of highly textured image data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust Image Coding Based Upon Compressive Sensing

    Page(s): 278 - 290
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4402 KB) |  | HTML iconHTML  

    Multiple description coding (MDC) is one of the widely used mechanisms to combat packet-loss in non-feedback systems. However, the number of descriptions in the existing MDC schemes is very small (typically 2). With the number of descriptions increasing, the coding complexity increases drastically and many decoders would be required. In this paper, the compressive sensing (CS) principles are studied and an alternative coding paradigm with a number of descriptions is proposed based upon CS for high packet loss transmission. Two-dimentional discrete wavelet transform (DWT) is applied for sparse representation. Unlike the typical wavelet coders (e.g., JPEG 2000), DWT coefficients here are not directly encoded, but re-sampled towards equal importance of information instead. At the decoder side, by fully exploiting the intra-scale and inter-scale correlation of multiscale DWT, two different CS recovery algorithms are developed for the low-frequency subband and high-frequency subbands, respectively. The recovery quality only depends on the number of received CS measurements (not on which of the measurements that are received). Experimental results show that the proposed CS-based codec is much more robust against lossy channels, while achieving higher rate-distortion (R-D) performance compared with conventional wavelet-based MDC methods and relevant existing CS-based coding schemes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Genre-Specific Semantic Video Indexing

    Page(s): 291 - 302
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2018 KB) |  | HTML iconHTML  

    Large video collections such as YouTube contain many different video genres, while in many applications the user might be interested in one or two specific video genres only. Thus, when users are querying the system with a specific semantic concept like AnchorPerson, and MovieStars, they are likely aiming a genre specific instantiation of this concept. Existing methods treat this problem as a classical learning problem leading to unnecessarily complex models. We propose a framework to detect visual-based genre-specific concepts in a more efficient and accurate way. We do so by using a two-step framework distinguishing two different levels. Genre-specific concept models are trained based on a training set with data labeled at video level for genres and at shot level for semantic concepts. In the classification stage, video genre classification is applied first to reduce the entire data set to a relatively small subset. Then, the genre-specific concept models are applied to this subset only. Experiments have been conducted on a small 28-h data set for genre-specific concept detection and a 4168-h (80 thinspace031 videos) benchmark data set for genre-specific topic search. Experimental results show that our proposed two-step method is more efficient and effective than existing methods which do not consider the different semantic levels between video genres and semantic concepts for both the indexing and the search tasks. When filtering out 80% of the data set, the average performance loss is about 11.3% for genre-specific concept detection and 31.5% for genre-specific topic search, while the processing speed increases hundreds of times for different video genres. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reducing DRAM Image Data Access Energy Consumption in Video Processing

    Page(s): 303 - 313
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1363 KB) |  | HTML iconHTML  

    This paper presents domain-specific techniques to reduce DRAM energy consumption for image data access in video processing. In mobile devices, video processing is one of the most energy-hungry tasks, and DRAM image data access energy consumption becomes increasingly dominant in overall video processing system energy consumption. Hence, it is highly desirable to develop domain-specific techniques that can exploit unique image data access characteristics to improve DRAM energy efficiency. Nevertheless, prior efforts on reducing DRAM energy consumption in video processing pale in comparison with that on reducing video processing logic energy consumption. In this work, we first apply three simple yet effective data manipulation techniques that exploit image data spatial/temporal correlation to reduce DRAM image data access energy consumption, then propose a heterogeneous DRAM architecture that can better adapt to unbalanced image access in most video processing to further improve DRAM energy efficiency. DRAM modeling and power estimation have been carried out to evaluate these domain-specific design techniques, and the results show that they can reduce DRAM energy consumption by up to 92%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bridging the Semantic Gap via Functional Brain Imaging

    Page(s): 314 - 325
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2049 KB) |  | HTML iconHTML  

    The multimedia content analysis community has made significant efforts to bridge the gaps between low-level features and high-level semantics perceived by humans. Recent advances in brain imaging and neuroscience in exploring the human brain's responses during multimedia comprehension demonstrated the possibility of leveraging cognitive neuroscience knowledge to bridge the semantic gaps. This paper presents our initial effort in this direction by using functional magnetic resonance imaging (fMRI). Specifically, task-based fMRI (T-fMRI) was performed to accurately localize the brain regions involved in video comprehension. Then, natural stimulus fMRI (N-fMRI) data were acquired when subjects watched the multimedia clips selected from the TRECVID datasets. The responses in the localized brain regions were measured and used to extract high-level features as the representation of the brain's comprehension of semantics in the videos. A novel computational framework was developed to learn the most relevant low-level feature sets that best correlate the fMRI-derived semantic features based on the training videos with fMRI scans, and then the learned model was applied to larger scale TRECVID video datasets without fMRI scans for category classification. Our experimental results demonstrate: 1) there are meaningful couplings between brain's fMRI-derived responses and video stimuli, suggesting the validity of linking semantics and low-level features via fMRI and 2) the computationally learned low-level features can significantly (p <; 0.01) improve video classification in comparison with original low-level features and extracted low-level features resulted from well-known feature projection algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Assessment of Stereoscopic Crosstalk Perception

    Page(s): 326 - 337
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1744 KB) |  | HTML iconHTML  

    Stereoscopic three-dimensional (3-D) services do not always prevail when compared with their two-dimensional (2-D) counterparts, though the former can provide more immersive experience with the help of binocular depth. Various specific 3-D artefacts might cause discomfort and severely degrade the Quality of Experience (QoE). In this paper, we analyze one of the most annoying artefacts in the visualization stage of stereoscopic imaging, namely, crosstalk, by conducting extensive subjective quality tests. A statistical analysis of the subjective scores reveals that both scene content and camera baseline have significant impacts on crosstalk perception, in addition to the crosstalk level itself. Based on the observed visual variations during changes in significant factors, three perceptual attributes of crosstalk are summarized as the sensorial results of the human visual system (HVS). These are shadow degree, separation distance, and spatial position of crosstalk. They are classified into two categories: 2-D and 3-D perceptual attributes, which can be described by a Structural SIMilarity (SSIM) map and a filtered depth map, respectively. An objective quality metric for predicting crosstalk perception is then proposed by combining the two maps. The experimental results demonstrate that the proposed metric has a high correlation (over 88%) when compared with subjective quality scores in a wide variety of situations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Role Recognition in Multiparty Conversations: An Approach Based on Turn Organization, Prosody, and Conditional Random Fields

    Page(s): 338 - 345
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (943 KB) |  | HTML iconHTML  

    Roles are a key aspect of social interactions, as they contribute to the overall predictability of social behavior (a necessary requirement to deal effectively with the people around us), and they result in stable, possibly machine-detectable behavioral patterns (a key condition for the application of machine intelligence technologies). This paper proposes an approach for the automatic recognition of roles in conversational broadcast data, in particular, news and talk shows. The approach makes use of behavioral evidence extracted from speaker turns and applies conditional random fields to infer the roles played by different individuals. The experiments are performed over a large amount of broadcast material (around 50 h), and the results show an accuracy higher than 85%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Path Modeling and Retrieval in Distributed Video Surveillance Databases

    Page(s): 346 - 360
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1670 KB) |  | HTML iconHTML  

    We propose a framework for querying a distributed database of video surveillance data in order to retrieve a set of likely paths of a person moving in the area under surveillance. In our framework, each camera of the surveillance system locally processes the data and stores video sequences in a storage unit and the metadata for each detected person in the distributed database. A pedestrian's path is formulated as a dynamic Bayesian network (DBN) to model the dependencies between subsequent observations of the person as he makes his way through the camera network. We propose a tool by which the analyst can pose queries about where a certain person appeared while moving in the site during a specified temporal window. The DBN is used in an algorithm that finds potentially relevant metadata records from the distributed databases and then assembles these into probable paths that the person took in the camera network. Finally, the system presents the analyst with the retrieved set of likely paths in ranked order. The computational complexity for our method is quadratic in the number of camera nodes and linear in the number of moving persons. Experiments were carried out on simulated data to test the system with large distributed databases and in a real setting in which six databases store the data from six video cameras. The simulations confirm that our method provides good results with varying numbers of cameras and persons moving in the network. In a real setting, the method reconstructs paths across the camera network with approximatively 75% accuracy at rank 1. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Weakly Supervised Graph Propagation Towards Collective Image Parsing

    Page(s): 361 - 373
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2569 KB) |  | HTML iconHTML  

    In this work, we propose a weakly supervised graph propagation method to automatically assign the annotated labels at image level to those contextually derived semantic regions. The graph is constructed with the over-segmented patches of the image collection as nodes. Image-level labels are imposed on the graph as weak supervision information over subgraphs, each of which corresponds to all patches of one image, and the contextual information across different images at patch level are then mined to assist the process of label propagation from images to their descendent regions. The ultimate optimization problem is efficiently solved by Convex Concave Programming (CCCP). Extensive experiments on four benchmark datasets clearly demonstrate the effectiveness of our proposed method for the task of collective image parsing. Two extensions including image annotation and concept map based image retrieval demonstrate the proposed image parsing algorithm can effectively aid other vision tasks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Investigating the Effects of Multiple Factors Towards More Accurate 3-D Object Retrieval

    Page(s): 374 - 388
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1248 KB) |  | HTML iconHTML  

    This paper proposes a novel framework for 3-D object retrieval, taking into account most of the factors that may affect the retrieval performance. Initially, a novel 3-D model alignment method is introduced, which achieves accurate rotation estimation through the combination of two intuitive criteria, plane reflection symmetry and rectilinearity. After the pose normalization stage, a low-level descriptor extraction procedure follows, using three different types of descriptors, which have been proven to be effective. Then, a novel combination procedure of the above descriptors takes place, which achieves higher retrieval performance than each descriptor does separately. The paper provides also an in-depth study of the factors that can further improve the 3-D object retrieval accuracy. These include selection of the appropriate dissimilarity metric, feature selection/dimensionality reduction on the initial low-level descriptors, as well as manifold learning for re-ranking of the search results. Experiments performed on two 3-D model benchmark datasets confirm our assumption that future research in 3-D object retrieval should focus more on the efficient combination of low-level descriptors as well as on the selection of the best features and matching metrics, than on the investigation of the optimal 3-D object descriptor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Probabilistic Motion Diffusion of Labeling Priors for Coherent Video Segmentation

    Page(s): 389 - 400
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2650 KB) |  | HTML iconHTML  

    We present a robust algorithm for temporally coherent video segmentation. Our approach is driven by multi-label graph cut applied to successive frames, fusing information from the current frame with an appearance model and labeling priors propagated forwarded from past frames. We propagate using a novel motion diffusion model, producing a per-pixel motion distribution that mitigates against cumulative estimation errors inherent in systems adopting “hard” decisions on pixel motion at each frame. Further, we encourage spatial coherence by imposing label consistency constraints within image regions (super-pixels) obtained via a bank of unsupervised frame segmentations, such as mean-shift. We demonstrate quantitative improvements in accuracy over state-of-the-art methods on a variety of sequences exhibiting clutter and agile motion, adopting the Berkeley methodology for our comparative evaluation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analytical Modeling for Delay-Sensitive Video Over WLAN

    Page(s): 401 - 414
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2831 KB) |  | HTML iconHTML  

    Delay-sensitive video transmission over IEEE 802.11 wireless local area networks (WLANs) is analyzed in a cross-layer optimization framework. The effect of delay constraint on the quality of received packets is studied by analyzing “expired-time packet discard rate”. Three analytical models are examined and it is shown that M/M/1 model is quite an adequate model for analyzing delay-limited applications such as live video transmission over WLAN. The optimal MAC retry limit corresponding to the minimum “total packet loss rate” is derived by exploiting both mathematical analysis and NS-2 simulations. We have shown that there is an interaction between "packet overflow drop" and "expired-time packet discard" processes in the queue. Subsequently, by introducing the concept of virtual buffer size, we will obtain the optimal buffer size in order to avoid "packet overflow drop". We finally introduced a simple and yet effective real-time algorithm for retry-limit adaptation over IEEE 802.11 MAC in order to maintain a loss protection for delay-critical video traffic transmission, and showed that the average link-layer throughput can be improved by using our adaptive scheme. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing Selective ARQ for H.264 Live Streaming: A Novel Method for Predicting Loss-Impact in Real Time

    Page(s): 415 - 430
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2465 KB) |  | HTML iconHTML  

    This work proposes a quality-oriented, real-time capable prioritization technique for media units of H.264/AVC video streams. The derivation of estimates is based on the analysis of the macroblock partitioning, the spatial extents of temporal dependencies, and the length and strength of prediction chains existing among macroblocks, thus incorporating the expected impact of error propagation. It is demonstrated how the prioritization scheme can be beneficially integrated into live streaming systems which are characterized by tight timing constraints, with the focus on content-aware selective automatic repeat request mechanisms. Additionally, it is shown how potentially limited feedback can be used to adapt the estimation process to leverage prediction preciseness. The approach is compared against existing techniques in terms of practicability and efficiency, and tested under independent and bursty loss conditions in a wired and a wireless test setup. Moreover, the performance is examined when low-latency and constant bitrate video settings are enforced by using x264's novel encoding feature periodic-intra-refresh. Results of both experiments and simulations indicate that the proposed technique outperforms all reference techniques in nearly all test cases, and that the video quality can be further improved by incorporating receiver feedback. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • QoE Prediction Model and its Application in Video Quality Adaptation Over UMTS Networks

    Page(s): 431 - 442
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1582 KB) |  | HTML iconHTML  

    The primary aim of this paper is to present a new content-based, non-intrusive quality of experience (QoE) prediction model for low bitrate and resolution (QCIF) H.264 encoded videos and to illustrate its application in video quality adaptation over Universal Mobile Telecommunication Systems (UMTS) networks. The success of video applications over UMTS networks very much depends on meeting the QoE requirements of users. Thus, it is highly desirable to be able to predict and, if appropriate, to control video quality to meet such QoE requirements. Video quality is affected by distortions caused both by the encoder and the UMTS access network. The impact of these distortions is content dependent, but this feature is not widely used in non-intrusive video quality prediction models. In the new model, we chose four key parameters that can impact video quality and hence the QoE-content type, sender bitrate, block error rate and mean burst length. The video quality was predicted in terms of the mean opinion score (MOS). Subjective quality tests were carried out to develop and evaluate the model. The performance of the model was evaluated with unseen dataset with good prediction accuracy ( ~ 93%). The model also performed well with the LIVE database which was recently made available to the research community. We illustrate the application of the new model in a novel QoE-driven adaptation scheme at the pre-encoding stage in a UMTS network. Simulation results in NS2 demonstrate the effectiveness of the proposed adaptation scheme, especially at the UMTS access network which is a bottleneck. An advantage of the model is that it is light weight (and so it can be implemented for real-time monitoring), and it provides a measure of user-perceived quality, but without requiring time-consuming subjective tests. The model has potential applications in several other areas, including QoE control and optimization in network planning and content provisioning for network/service providers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Joint Source-Channel Coding and Optimization for Layered Video Broadcasting to Heterogeneous Devices

    Page(s): 443 - 455
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2157 KB) |  | HTML iconHTML  

    Heterogeneous quality-of-service (QoS) video broadcast over wireless network is a challenging problem, where the demand for better video quality needs to be reconciled with different display size, variable channel condition requirements. In this paper, we present a framework for broadcasting scalable video to heterogeneous QoS mobile users with diverse display devices and different channel conditions. The framework includes joint video source-channel coding and optimization. First, we model the problem of broadcasting a layered video to heterogeneous devices as an aggregate utility achieving problem. Second, based on scalable video coding, we introduce the temporal-spatial content distortion metric to build adaptive layer structure, so as to serve mobile users with heterogeneous QoS requirements. Third, joint Fountain coding protection is introduced so as to provide flexible and reliable video stream. Finally, we use dynamic programming approach to obtain optimal layer broadcasting policy, so as to achieve maximum broadcasting utility. The objective is to achieve maximum overall receiving quality of the heterogeneous QoS receivers. Experimental results demonstrate the effectiveness of the solution. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Web Video Geolocation by Geotagged Social Resources

    Page(s): 456 - 470
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1895 KB) |  | HTML iconHTML  

    This paper considers the problem of web video geolocation: we hope to determine where on the Earth a web video was taken. By analyzing a 6.5-million geotagged web video dataset, we observe that there exist inherent geography intimacies between a video with its relevant videos (related videos and same-author videos). This social relationship supplies a direct and effective cue to locate the video to a particular region on the earth. Based on this observation, we propose an effective web video geolocation algorithm by propagating geotags among the web video social relationship graph. For the video that have no geotagged relevant videos, we aim to collect those geotagged relevant images that are content similar with the video (share some visual or textual information with the video) as the cue to infer the location of the video. The experiments have demonstrated the effectiveness of both methods, with the geolocation accuracy much better than state-of-the-art approaches. Finally, an online web video geolocation system: Video2Locatoin (V2L) is developed to provide public access to our algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hierarchical Co-Clustering: A New Way to Organize the Music Data

    Page(s): 471 - 481
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1484 KB) |  | HTML iconHTML  

    In music information retrieval (MIR) an important research topic, which has attracted much attention recently, is the utilization of user-assigned tags, artist-related style, and mood labels, which can be extracted from music listening web sites, such as Last.fm (http://www.last.fm/) and All Music Guide (http://www.allmusic.com/). A fundamental research problem in the area is how to understand the relationships among artists/songs and these different pieces of information. Co-clustering is the problem of simultaneously clustering two types of data (e.g., documents and words, and webpages and urls). We can naturally bring this idea to the situation at hand and consider clustering artists and tags together, artists and styles together, or artists and mood labels together. Once such co-clustering has been successfully completed, one can identify co-existing clusters of artists and tags, styles, or mood labels (T/S/M). For simplicity, we use the acronym T/S/M to refer to tag(s), style(s), or mood(s) for the rest of the paper. When dealing with tags it is worth noticing that some tags are more specific versions of others. This naturally suggests that the tags could be organized in hierarchical clusters. Such hierarchical organizations exist for styles and mood labels, so we will consider hierarchical co-clustering of artists and T/S/M. In this paper, we systematically study the application of hierarchical co-clustering (HCC) methods for organizing the music data. There are two standard strategies for hierarchical clustering. One is the divisive strategy, in which we attempt to divide the input data set into smaller groups recursively, and the other is the agglomerative strategy, in which we attempt to combine initially individually separated data points into larger groups by finding the most closely related pair at each iteration. We will compare these two strategies against each other. We apply a previously known divisive hierarchical co-clustering method and a novel a- glomerative hierarchical co-clustering. In addition, we demonstrate that these two methods have the capability of incorporating instance-level constraints to achieve better performance. We perform experiments to show that these two hierarchical co-clustering methods can be effectively deployed for organizing the music data and they present reasonable clustering performance comparing with the other clustering methods. A case study is also conducted to show that HCC provides us a new method to quantify the artist similarity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robustly Extracting Captions in Videos Based on Stroke-Like Edges and Spatio-Temporal Analysis

    Page(s): 482 - 489
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1507 KB) |  | HTML iconHTML  

    This paper presents an effective and efficient approach to extracting captions from videos. The robustness of our system comes from two aspects of contributions. First, we propose a novel stroke-like edge detection method based on contours, which can effectively remove the interference of non-stroke edges in complex background so as to make the detection and localization of captions much more accurate. Second, our approach highlights the importance of temporal feature, i.e., inter-frame feature, in the task of caption extraction (detection, localization, segmentation). Instead of regarding each video frame as an independent image, through fully utilizing the temporal feature of video together with spatial analysis in the computation of caption localization, segmentation and post-processing, we demonstrate that the use of inter-frame information can effectively improve the accuracy of caption localization and caption segmentation. In the comprehensive our evaluation experiments, the experimental results on two representative datasets have shown the robustness and efficiency of our approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Correction to “Bayesian Visual Reranking” [Aug 2011 639-652]

    Page(s): 490
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (85 KB)  

    In the above titled paper (ibid., vol. 13, no. 4, pp. 639-652, Aug. 2011), the first author's name appears incorrectly in the byline as "Xinmie Tian" instead of "Xinmei Tian." The name appears correctly in the biography section. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Multimedia EDICS

    Page(s): 491
    Save to Project icon | Request Permissions | PDF file iconPDF (16 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Multimedia information for authors

    Page(s): 492 - 493
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE

Aims & Scope

The scope of the Periodical is the various aspects of research in multimedia technology and applications of multimedia.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Chang Wen Chen
State University of New York at Buffalo