<?xml version="1.0" ?>
<rss version="2.0">
	<channel>
		<title><![CDATA[ Multimedia, IEEE Transactions on - new TOC ]]></title>
		<link>http://ieeexplore.ieee.org</link>
		<description>TOC Alert for Publication# 6046 </description>
		<year>2012</year>
		<month>February </month>
		<day>10</day>
		<item>
			<title><![CDATA[Table of Contents]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6130621]]></link>
			<description><![CDATA[ ]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6130621]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>C1</startPage>
			<endPage>C4</endPage>
			<fileSize>49</fileSize>
			<authors><![CDATA[]]></authors>
		</item>
		<item>
			<title><![CDATA[IEEE Transactions on Multimedia publication information]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6130625]]></link>
			<description><![CDATA[ ]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6130625]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>C2</startPage>
			<endPage>C2</endPage>
			<fileSize>35</fileSize>
			<authors><![CDATA[]]></authors>
		</item>
		<item>
			<title><![CDATA[Special Section on Object and Event Classification in Large-Scale Video Collections]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6130576]]></link>
			<description><![CDATA[The nine papers in this special section on object and event classification in large-scale video collections can be categorized into four themes: video indexing, concept detection, video summarization, and event recognition.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6130576]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>1</startPage>
			<endPage>2</endPage>
			<fileSize>34</fileSize>
			<authors><![CDATA[Xu, C.;Hanjalic, A.;Yan, S.;Liu, Q.;Smeaton, A. F.;]]></authors>
		</item>
		<item>
			<title><![CDATA[Multimodal Video Indexing and Retrieval Using Directed Information]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6009223]]></link>
			<description><![CDATA[We propose a novel framework for multimodal video indexing and retrieval using shrinkage optimized directed information assessment (SODA) as similarity measure. The directed information (DI) is a variant of the classical mutual information which attempts to capture the direction of information flow that videos naturally possess. It is applied directly to the empirical probability distributions of both audio-visual features over successive frames. We utilize RASTA-PLP features for audio feature representation and SIFT features for visual feature representation. We compute the joint probability density functions of audio and visual features in order to fuse features from different modalities. With SODA, we further estimate the DI in a manner that is suitable for high dimensional features <i>p</i> and small sample size <i>n</i> (large <i>p</i> small <i>n</i> ) between pairs of video-audio modalities. We demonstrate the superiority of the SODA approach in video indexing, retrieval, and activity recognition as compared to the state-of-the-art methods such as hidden Markov models (HMM), support vector machine (SVM), cross-media indexing space (CMIS), and other noncausal divergence measures such as mutual information (MI). We also demonstrate the success of SODA in audio and video localization and indexing/retrieval of data with missaligned modalities.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6009223]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>3</startPage>
			<endPage>16</endPage>
			<fileSize>1116</fileSize>
			<authors><![CDATA[Xu Chen;Hero, A.O.;Savarese, S.;]]></authors>
		</item>
		<item>
			<title><![CDATA[Interactive Video Indexing With Statistical Active Learning]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6069865]]></link>
			<description><![CDATA[Video indexing, also called video concept detection, has attracted increasing attentions from both academia and industry. To reduce human labeling cost, active learning has been introduced to video indexing recently. In this paper, we propose a novel active learning approach based on the optimum experimental design criteria in statistics. Different from existing optimum experimental design, our approach simultaneously exploits sample's local structure, and sample relevance, density, and diversity information, as well as makes use of labeled and unlabeled data. Specifically, we develop a local learning model to exploit the local structure of each sample. Our assumption is that for each sample, its label can be well estimated based on its neighbors. By globally aligning the local models from all the samples, we obtain a local learning regularizer, based on which a local learning regularized least square model is proposed. Finally, a unified sample selection approach is developed for interactive video indexing, which takes into account the sample relevance, density and diversity information, and sample efficacy in minimizing the parameter variance of the proposed local learning regularized least square model. We compare the performance between our approach and the state-of-the-art approaches on the TREC video retrieval evaluation (TRECVID) benchmark. We report superior performance from the proposed approach.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6069865]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>17</startPage>
			<endPage>27</endPage>
			<fileSize>1077</fileSize>
			<authors><![CDATA[Zheng-Jun Zha;Meng Wang;Yan-Tao Zheng;Yi Yang;Richang Hong;Tat-Seng Chua;]]></authors>
		</item>
		<item>
			<title><![CDATA[Large-Scale Vehicle Detection, Indexing, and Search in Urban Surveillance Videos]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6035786]]></link>
			<description><![CDATA[We present a novel approach for visual detection and attribute-based search of vehicles in crowded surveillance scenes. Large-scale processing is addressed along two dimensions: 1) large-scale indexing, where hundreds of billions of events need to be archived per month to enable effective search and 2) learning vehicle detectors with large-scale feature selection, using a feature pool containing millions of feature descriptors. Our method for vehicle detection also explicitly models occlusions and multiple vehicle types (e.g., buses, trucks, SUVs, cars), while requiring very few manual labeling. It runs quite efficiently at an average of 66 Hz on a conventional laptop computer. Once a vehicle is detected and tracked over the video, fine-grained attributes are extracted and ingested into a database to allow future search queries such as &#x201C;Show me all blue trucks larger than 7 ft. length traveling at high speed northbound last Saturday, from 2 pm to 5 pm&#x201D;. We perform a comprehensive quantitative analysis to validate our approach, showing its usefulness in realistic urban surveillance settings.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6035786]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>28</startPage>
			<endPage>42</endPage>
			<fileSize>2390</fileSize>
			<authors><![CDATA[Feris, R.S.;Siddiquie, B.;Petterson, J.;Yun Zhai;Datta, A.;Brown, L.M.;Pankanti, S.;]]></authors>
		</item>
		<item>
			<title><![CDATA[Sparse Ensemble Learning for Concept Detection]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6020805]]></link>
			<description><![CDATA[This work presents a novel sparse ensemble learning scheme for concept detection in videos. The proposed ensemble first exploits a sparse non-negative matrix factorization (NMF) process to represent data instances in parts and partition the data space into localities, and then coordinates the individual classifiers in each locality for final classification. In the sparse NMF, data exemplars are projected to a set of locality bases, in which the non-negative superposition of basis images reconstructs the original exemplars. This additive combination ensures that each locality captures the characteristics of data exemplars in part, thus enabling the local classifiers to hold reasonable diversity in their own regions of expertise. More importantly, the sparse NMF ensures that an exemplar is projected to only a few bases (localities) with non-zero coefficients. The resultant ensemble model is, therefore, sparse, in the way that only a small number of efficient classifiers in the ensemble will fire on a testing sample. Extensive tests on the TRECVid 08 and 09 datasets show that the proposed ensemble learning achieves promising results and outperforms existing approaches. The proposed scheme is feature-independent, and can be applied in many other large scale pattern recognition problems besides visual concept detection.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6020805]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>43</startPage>
			<endPage>54</endPage>
			<fileSize>1642</fileSize>
			<authors><![CDATA[Sheng Tang;Yan-Tao Zheng;Yu Wang;Tat-Seng Chua;]]></authors>
		</item>
		<item>
			<title><![CDATA[Parallel Lasso for Large-Scale Video Concept Detection]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6069863]]></link>
			<description><![CDATA[Existing video concept detectors are generally built upon the kernel based machine learning techniques, e.g., support vector machines, regularized least squares, and logistic regression, just to name a few. However, in order to build robust detectors, the learning process suffers from the scalability issues including the high-dimensional multi-modality visual features and the large-scale keyframe examples. In this paper, we propose parallel lasso (Plasso) by introducing the parallel distributed computation to significantly improve the scalability of lasso (the <i>l</i><sub>1</sub> regularized least squares). We apply the parallel incomplete Cholesky factorization to approximate the covariance statistics in the preprocess step, and the parallel primal-dual interior-point method with the Sherman-Morrison-Woodbury formula to optimize the model parameters. For a dataset with <i>n</i> samples in a <i>d</i>-dimensional space, compared with lasso, Plasso significantly reduces complexities from the original <i>O</i>(<i>d</i><sup>3</sup>) for computational time and <i>O</i>(<i>d</i><sup>2</sup>) for storage space to <i>O</i>(<i>h</i><sup>2</sup><i>d</i>/<i>m</i>) and <i>O</i>(<i>hd</i>/<i>m</i>) , respectively, if the system has <i>m</i> processors and the reduced dimension <i>h</i> is much smaller than the original dimension <i>d</i> . Furthermore, we develop the kernel extension of the proposed linear algorithm with the sample reweighting schema, and we can achieve similar time and space complexity improvements [time complexity from <i>O</i>(<i>n</i><sup>3</sup>) to <i>O</i>(<i>h</i><sup>2</sup><i>n</i>/<i>m</i>) and the space complexity from <i>O</i>(<i>n</i><sup>2</sup>) to <i>O</i>(<i>hn</i>/<i>m</i>), for a dataset with <i>n</i> training examples]. Experimental results on TRECVID video concept detection challenges suggest that the proposed method can obtain significant time and space savings for training effective detectors with limited communication overhead.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6069863]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>55</startPage>
			<endPage>65</endPage>
			<fileSize>1192</fileSize>
			<authors><![CDATA[Bo Geng;Yangxi Li;Dacheng Tao;Meng Wang;Zheng-Jun Zha;Chao Xu;]]></authors>
		</item>
		<item>
			<title><![CDATA[Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6008652]]></link>
			<description><![CDATA[The rapid growth of consumer videos requires an effective and efficient content summarization method to provide a user-friendly way to manage and browse the huge amount of video data. Compared with most previous methods that focus on sports and news videos, the summarization of personal videos is more challenging because of its unconstrained content and the lack of any pre-imposed video structures. We formulate video summarization as a novel dictionary selection problem using sparsity consistency, where a dictionary of key frames is selected such that the original video can be best reconstructed from this representative dictionary. An efficient global optimization algorithm is introduced to solve the dictionary selection model with the convergence rates as <i>O</i>(1/<i>K</i><sup>2</sup>) (where <i>K</i> is the iteration counter), in contrast to traditional sub-gradient descent methods of <i>O</i>(1/&#x221A;<i>K</i>). Our method provides a scalable solution for both key frame extraction and video skim generation, because one can select an arbitrary number of key frames to represent the original videos. Experiments on a human labeled benchmark dataset and comparisons to the state-of-the-art methods demonstrate the advantages of our algorithm.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6008652]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>66</startPage>
			<endPage>75</endPage>
			<fileSize>2141</fileSize>
			<authors><![CDATA[Yang Cong;Junsong Yuan;Jiebo Luo;]]></authors>
		</item>
		<item>
			<title><![CDATA[Summarizing Rushes Videos by Motion, Object, and Event Understanding]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=5993544]]></link>
			<description><![CDATA[Rushes footages are considered as cheap gold mine with the potential for reuse in broadcasting and filmmaking industries. However, mining &#x201C;gold&#x201D; from unedited videos such as rushes is challenging as the reusable segments are buried in a large set of redundant information. In this paper, we propose a unified framework for stock footage classification and summarization to support video editors in navigating and organizing rushes videos. Our approach is composed of two steps. First, we employ motion features to filter the undesired camera motion and locate the stock footage. A hierarchical hidden Markov model (HHMM) is proposed to model the motion feature distribution and classify video segments into different categories to decide their potential for reuse. Second, we generate a short video summary to facilitate quick browsing of the stock footages by including the objects and events that are important for storytelling. For objects, we detect the presence of persons and moving objects. For events, we extract a set of features to detect and describe visual (motion activities and scene changes) and audio events (speech clips). A representability measure is then proposed to select the most representative video clips for video summarization. Our experiments show that the proposed HHMM significantly outperforms other methods based on SVM, FSM, and HMM. The automatically generated rushes summaries are also demonstrated to be easy-to-understand, containing little redundancy, and capable of including ground-truth objects and events with shorter durations and relatively pleasant rhythm based on the TRECVID 2007, 2008, and our subjective evaluations.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=5993544]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>76</startPage>
			<endPage>87</endPage>
			<fileSize>981</fileSize>
			<authors><![CDATA[Feng Wang;Chong-Wah Ngo;]]></authors>
		</item>
		<item>
			<title><![CDATA[Semantic Model Vectors for Complex Video Event Recognition]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6024471]]></link>
			<description><![CDATA[We propose semantic model vectors, an intermediate level semantic representation, as a basis for modeling and detecting complex events in unconstrained real-world videos, such as those from YouTube. The semantic model vectors are extracted using a set of discriminative semantic classifiers, each being an ensemble of SVM models trained from thousands of labeled web images, for a total of 280 generic concepts. Our study reveals that the proposed semantic model vectors representation outperforms-and is complementary to-other low-level visual descriptors for video event modeling. We hence present an end-to-end video event detection system, which combines semantic model vectors with other static or dynamic visual descriptors, extracted at the frame, segment, or full clip level. We perform a comprehensive empirical study on the 2010 TRECVID Multimedia Event Detection task (http://www.nist.gov/itl/iad/mig/med10.cfm), which validates the semantic model vectors representation not only as the best individual descriptor, outperforming state-of-the-art global and local static features as well as spatio-temporal HOG and HOF descriptors, but also as the most compact. We also study early and late feature fusion across the various approaches, leading to a 15% performance boost and an overall system performance of 0.46 mean average precision. In order to promote further research in this direction, we made our semantic model vectors for the TRECVID MED 2010 set publicly available for the community to use (http://www1.cs.columbia.edu/~mmerler/SMV.html).]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6024471]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>88</startPage>
			<endPage>101</endPage>
			<fileSize>2204</fileSize>
			<authors><![CDATA[Merler, M.;Huang, B.;Lexing Xie;Gang Hua;Natsev, A.;]]></authors>
		</item>
		<item>
			<title><![CDATA[A Matrix-Based Approach to Unsupervised Human Action Categorization]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6082444]]></link>
			<description><![CDATA[Human action, as the basic unit of most human-relevant video content, bridges the gap between low-level visual features and high-level semantics. Human action recognition is of great significance in the applications of human-computer interaction, intelligent video surveillance, video retrieval and search. In this paper, we propose a novel unsupervised approach to mining categories from action video sequences, which consists of two modules: action representation for video data structurization and learning model for unsupervised categorization. In action representation, a novel view of video decomposition is presented. Videos are regarded as spatially distributed dynamic pixel time series, and these dynamic pixels are first quantized into pixel prototypes. After replacing the pixel time series with their corresponding prototype labels, the video sequences are compressed into two-dimensional action matrices. In the learning model, we put these matrices together to form an multi-action tensor, and propose the joint matrix factorization method to simultaneously cluster the pixel prototypes into pixel signatures, and matrices into action classes with the consideration of the duality between pixel clustering and action clustering. The approach is tested on public and popular Weizmann, and KTH datasets, and promising results are achieved.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6082444]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>102</startPage>
			<endPage>110</endPage>
			<fileSize>745</fileSize>
			<authors><![CDATA[Peng Cui;Fei Wang;Li-Feng Sun;Jian-Wei Zhang;Shi-Qiang Yang;]]></authors>
		</item>
		<item>
			<title><![CDATA[Efficient Video Coding Using Legacy Algorithmic Approaches]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6025300]]></link>
			<description><![CDATA[We show that for high bit rates, a video coding algorithm using a suitable combination of the QM coder and on other methods first published over 20 years ago can deliver video quality rivaling that of H.264 at lower complexity. This has implications both technically, since encoders built using these methods can be more power efficient, and commercially, given the complex licensing and intellectual property issues that accompany newer coding methods such as H.264 and MPEG-4. The methods described in this paper are the basis for the recent decision of the MPEG standards group to begin work on what is referred to as the &#x201C;Type-1 Video Coding&#x201D; standard, which, in addition to aiming for high coding efficiency, is intended to minimize royalty issues.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6025300]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>111</startPage>
			<endPage>120</endPage>
			<fileSize>922</fileSize>
			<authors><![CDATA[Jianwen Chen;Feng Xu;Yun He;Villasenor, J.;Yuxing Han;Yan Xu;Yaocheng Rong;Reader, C.;Jiangtao Wen;]]></authors>
		</item>
		<item>
			<title><![CDATA[Depth Video Coding Using Adaptive Geometry Based Intra Prediction for 3-D Video Systems]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6025301]]></link>
			<description><![CDATA[Depth video coding is an essential part of 3-D video processing systems. Specifically, object boundary regions are important in depth video coding since these regions significantly affect the visual quality of a synthesized view. In this paper, we propose an efficient depth video coding method to determine precise intra prediction modes and thereby reduce the loss of boundary information. To achieve this objective, we analyze and exploit statistical and geometric characteristics of the depth video. Experimental results subsequently show that the proposed method performs better than the original intra prediction of H.264/AVC in terms of bit savings and rendering quality.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6025301]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>121</startPage>
			<endPage>128</endPage>
			<fileSize>1836</fileSize>
			<authors><![CDATA[Min-Koo Kang;Yo-Sung Ho;]]></authors>
		</item>
		<item>
			<title><![CDATA[Rhythm of Motion Extraction and Rhythm-Based Cross-Media Alignment for Dance Videos]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6051493]]></link>
			<description><![CDATA[We present how to extract rhythm information in dance videos and music, and accordingly correlate them based on rhythmic representation. From dancer's movement, we construct motion trajectories, detect turnings, and stops of trajectories, and then estimate rhythm of motion (ROM). For music, beats are detected to describe rhythm of music. Two modalities are therefore represented as sequences of rhythm information to facilitate finding cross-media correspondence. Two applications, i.e., background music replacement and music video generation, are developed to demonstrate the practicality of cross-media correspondence. We evaluate performance of ROM extraction, and conduct subjective/objective evaluation to show that rich browsing experience can be provided by the proposed applications.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6051493]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>129</startPage>
			<endPage>141</endPage>
			<fileSize>1255</fileSize>
			<authors><![CDATA[Wei-Ta Chu;Shang-Yin Tsai;]]></authors>
		</item>
		<item>
			<title><![CDATA[Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6042338]]></link>
			<description><![CDATA[This paper presents an approach to the automatic recognition of human emotions from audio-visual bimodal signals using an error weighted semi-coupled hidden Markov model (EWSC-HMM). The proposed approach combines an SC-HMM with a state-based bimodal alignment strategy and a Bayesian classifier weighting scheme to obtain the optimal emotion recognition result based on audio-visual bimodal fusion. The state-based bimodal alignment strategy in SC-HMM is proposed to align the temporal relation between audio and visual streams. The Bayesian classifier weighting scheme is then adopted to explore the contributions of the SC-HMM-based classifiers for different audio-visual feature pairs in order to obtain the emotion recognition output. For performance evaluation, two databases are considered: the MHMC posed database and the SEMAINE naturalistic database. Experimental results show that the proposed approach not only outperforms other fusion-based bimodal emotion recognition methods for posed expressions but also provides satisfactory results for naturalistic expressions.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6042338]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>142</startPage>
			<endPage>156</endPage>
			<fileSize>1537</fileSize>
			<authors><![CDATA[Jen-Chun Lin;Chung-Hsien Wu;Wen-Li Wei;]]></authors>
		</item>
		<item>
			<title><![CDATA[Asymmetric Coding of Multi-View Video Plus Depth Based 3-D Video for View Rendering]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6025302]]></link>
			<description><![CDATA[The recent years have witnessed three-dimensional (3-D) video technology to become increasingly popular, as it can provide high-quality and immersive experience to end users, where view rendering with depth-image-based rendering (DIBR) technique is employed to generate the virtual views. Distortions in depth map may induce geometry changes in the virtual views, and distortions in texture video may be propagated to the virtual views. Thus, effective compression of both texture videos and depth maps is important for 3-D video system. From the perspective of bit allocation, asymmetric coding of the texture videos and depth maps is an effective way to get the optimal solution of 3-D video compression and view rendering problems. In this paper, a novel asymmetric coding method of multi-view video plus depth (MVD) based 3-D video is proposed on purpose of providing high-quality view rendering. In the proposed method, two models are proposed to characterize view rendering distortion and binocular suppression in 3-D video. Then, an asymmetric coding method of MVD-based 3-D video is proposed by combining two models in encoding framework. Finally, a chrominance reconstruction algorithm is presented to achieve accurate reconstruction. Experimental results show that compared with other methods, the proposed method can obtain higher performance of view rendering under the total bitrate constraint. Moreover, the perceptual visual quality of 3-D video is almost unaffected with the proposed method.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6025302]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>157</startPage>
			<endPage>167</endPage>
			<fileSize>1199</fileSize>
			<authors><![CDATA[Feng Shao;Gangyi Jiang;Mei Yu;Ken Chen;Yo-Sung Ho;]]></authors>
		</item>
		<item>
			<title><![CDATA[Nonrigid Structure-From-Motion From 2-D Images Using Markov Chain Monte Carlo]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6032104]]></link>
			<description><![CDATA[In this paper we present a new method for simultaneously determining 3-D shape and motion of a nonrigid object from uncalibrated 2-D images without assuming the distribution characteristics. A nonrigid motion can be treated as a combination of a rigid rotation and a nonrigid deformation. To seek accurate recovery of deformable structures, we estimate the probability distribution function of the corresponding features through random sampling, incorporating an established probabilistic model. The fitting between the observation and the projection of the estimated 3-D structure will be evaluated using a Markov chain Monte Carlo based expectation maximization algorithm. Applications of the proposed method to both synthetic and real image sequences are demonstrated with promising results.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6032104]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>168</startPage>
			<endPage>177</endPage>
			<fileSize>1101</fileSize>
			<authors><![CDATA[Huiyu Zhou;Xuelong Li;Sadka, A.H.;]]></authors>
		</item>
		<item>
			<title><![CDATA[A Novel Large-Scale Digital Forensics Service Platform for Internet Videos]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6032752]]></link>
			<description><![CDATA[The increasing transmission of illegal videos over the Internet imposes the needs to develop large-scale digital video forensics systems for prosecuting and deterring digital crimes in the Internet. In this paper, we propose, design, and implement a novel large-scale Digital Forensics Service Platform (DFSP) that can effectively detect illegal content from Internet videos. More specifically, we propose a distributed architecture by taking advantage of Content Delivery Network (CDN) to improve scalability, which can process enormous number of Internet videos in real time. We propose CDN-based Resource-Aware Scheduling (CRAS) algorithm, which schedules the tasks efficiently in the DFSP according to resource parameters, such as delay and computation load. We deploy the DFSP system in the Internet, which integrates the CDN-based distributed architecture and CRAS algorithm with a large-scale video detection algorithm, and evaluate the deployed system. Our evaluation results demonstrate the effectiveness of the platform.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6032752]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>178</startPage>
			<endPage>186</endPage>
			<fileSize>911</fileSize>
			<authors><![CDATA[Hao Yin;Wen Hui;Hongzhi Li;Chuang Lin;Wenwu Zhu;]]></authors>
		</item>
		<item>
			<title><![CDATA[Bottom-Up Saliency Detection Model Based on Human Visual Sensitivity and Amplitude Spectrum]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6029456]]></link>
			<description><![CDATA[With the wide applications of saliency information in visual signal processing, many saliency detection methods have been proposed. However, some key characteristics of the human visual system (HVS) are still neglected in building these saliency detection models. In this paper, we propose a new saliency detection model based on the human visual sensitivity and the amplitude spectrum of quaternion Fourier transform (QFT). We use the amplitude spectrum of QFT to represent the color, intensity, and orientation distributions for image patches. The saliency value for each image patch is calculated by not only the differences between the QFT amplitude spectrum of this patch and other patches in the whole image, but also the visual impacts for these differences determined by the human visual sensitivity. The experiment results show that the proposed saliency detection model outperforms the state-of-the-art detection models. In addition, we apply our proposed model in the application of image retargeting and achieve better performance over the conventional algorithms.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6029456]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>187</startPage>
			<endPage>198</endPage>
			<fileSize>1579</fileSize>
			<authors><![CDATA[Yuming Fang;Weisi Lin;Bu-Sung Lee;Chiew-Tong Lau;Zhenzhong Chen;Chia-Wen Lin;]]></authors>
		</item>
		<item>
			<title><![CDATA[Hidden-Concept Driven Multilabel Image Annotation and Label Ranking]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6035980]]></link>
			<description><![CDATA[Conventional semisupervised image annotation algorithms usually propagate labels predominantly via holistic similarities over image representations and do not fully consider the label locality, inter-label similarity, and intra-label diversity among multilabel images. Taking these problems into consideration, we present the hidden-concept driven image annotation and label ranking algorithm (HDIALR), which conducts label propagation based on the similarity over a visually semantically consistent hidden-concepts space. The proposed method has the following characteristics: 1) each holistic image representation is implicitly decomposed into label representations to reveal label locality: the decomposition is guided by the so-called hidden concepts, characterizing image regions and reconstructing both visual and nonvisual labels of the entire image; 2) each label is represented by a linear combination of hidden concepts, while the similar linear coefficients reveal the inter-label similarity; 3) each hidden concept is expressed as a respective subspace, and different expressions of the same label over the subspace then induce the intra-label diversity; and 4) the sparse coding-based graph is proposed to enforce the collective consistency between image labels and image representations, such that it naturally avoids the dilemma of possible inconsistency between the pairwise label similarity and image representation similarity in multilabel scenario. These properties are finally embedded in a regularized nonnegative data factorization formulation, which decomposes images representations into label representations over both labeled and unlabeled data for label propagation and ranking. The objective function is iteratively optimized by a convergence provable updating procedure. Extensive experiments on three benchmark image datasets well validate the effectiveness of our proposed solution to semisupervised multilabel image annotation and label ranking problem.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6035980]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>199</startPage>
			<endPage>210</endPage>
			<fileSize>2388</fileSize>
			<authors><![CDATA[Bing-Kun Bao;Teng Li;Shuicheng Yan;]]></authors>
		</item>
		<item>
			<title><![CDATA[An Enhanced Bag-of-Visual Word Vector Space Model to Represent Visual Content in Athletics Images]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6035787]]></link>
			<description><![CDATA[Images that have a different visual appearance may be semantically related using a higher level conceptualization. However, image classification and retrieval systems tend to rely only on the low-level visual structure within images. This paper presents a framework to deal with this semantic gap limitation by exploiting the well-known bag-of-visual words (BVW) to represent visual content. The novelty of this paper is threefold. First, the quality of visual words is improved by constructing visual words from representative keypoints. Second, domain specific &#x201C;non-informative visual words&#x201D; are detected which are useless to represent the content of visual data but which can degrade the categorization capability. Distinct from existing frameworks, two main characteristics for non-informative visual words are defined: a high document frequency (DF) and a small statistical association with all the concepts in the collection. The third contribution in this paper is that a novel method is used to restructure the vector space model of visual words with respect to a structural ontology model in order to resolve visual synonym and polysemy problems. The experimental results show that our method can disambiguate visual word senses effectively and can significantly improve classification, interpretation, and retrieval performance for the athletics images.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6035787]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>211</startPage>
			<endPage>222</endPage>
			<fileSize>1165</fileSize>
			<authors><![CDATA[Kesorn, K.;Poslad, S.;]]></authors>
		</item>
		<item>
			<title><![CDATA[A Model-Based Shot Boundary Detection Technique Using Frame Transition Parameters]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6035981]]></link>
			<description><![CDATA[We have presented a unified model for detecting different types of video shot transitions. Based on the proposed model, we formulate frame estimation scheme using the previous and the next frames. Unlike other shot boundary detection algorithms, instead of properties of frames, frame transition parameters and frame estimation errors based on global and local features are used for boundary detection and classification. Local features include scatter matrix of edge strength and motion matrix. Finally, the frames are classified as no change (within shot frame), abrupt change, or gradual change frames using a multilayer perceptron network. The proposed method is relatively less dependent on user defined thresholds and is free from sliding window size as widely used by various schemes found in the literature. Moreover, handling both abrupt and gradual transitions along with non-transition frames under a single framework using model guided visual feature is another unique aspect of the work.]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6035981]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>223</startPage>
			<endPage>233</endPage>
			<fileSize>910</fileSize>
			<authors><![CDATA[Mohanta, P.P.;Saha, S.K.;Chanda, B.;]]></authors>
		</item>
		<item>
			<title><![CDATA[IEEE Transactions on Multimedia Edics]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6130623]]></link>
			<description><![CDATA[ ]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6130623]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>234</startPage>
			<endPage>234</endPage>
			<fileSize>16</fileSize>
			<authors><![CDATA[]]></authors>
		</item>
		<item>
			<title><![CDATA[IEEE Transactions on Multimedia information for authors]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6130622]]></link>
			<description><![CDATA[ ]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6130622]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>235</startPage>
			<endPage>236</endPage>
			<fileSize>46</fileSize>
			<authors><![CDATA[]]></authors>
		</item>
		<item>
			<title><![CDATA[IEEE Transactions on Multimedia society information]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6130624]]></link>
			<description><![CDATA[ ]]></description>
			<pubDate><![CDATA[Feb.  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6130620&arnumber=6130624]]></guid>
			<volume>14</volume>
			<issue>1</issue>
			<startPage>C3</startPage>
			<endPage>C3</endPage>
			<fileSize>28</fileSize>
			<authors><![CDATA[]]></authors>
		</item>
	</channel>
</rss>
