By Topic

Topic-Based Hierarchical Segmentation

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Jen-Tzung Chien ; Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, ROC ; Chuang-Hua Chueh

Latent Dirichlet allocation (LDA) is a new paradigm of topic model which is powerful to capture the latent topic information from natural language. However, the topic information in text streams, e.g. meeting recording, lecture transcription and conversational dialogue, are inherently heterogeneous and nonstationary without explicit boundaries. It is difficult to train a precise topic model from the observed text streams. Furthermore, the usage of words in different paragraphs within a document is varied with different composition styles. In this paper, we present a new hierarchical segmentation model (HSM) where the heterogeneous topic information in stream level and the word variations in document level are characterized. We incorporate the contextual topic information in stream-level segmentation. The topic similarity between sentences is used to form a beta distribution reflecting the prior knowledge of document boundaries in a text stream. The distribution of segmentation variable is adaptively updated to achieve flexible segmentation and is used to group coherent sentences into a topic-specific document. For each pseudo-document, we further use a Markov chain to detect the stylistic segments within a document. The words in a segment are accordingly generated by the same composition style, which differs from the style of the next segment. Each segment is represented by a Markov state, and so the word variations within a document are compensated. The whole model is trained by a variational Bayesian EM procedure and is evaluated on using TDT2 corpus. Experimental results show benefits by using the proposed HSM in terms of perplexity, segmentation error, detection accuracy and F measure.

Published in:

IEEE Transactions on Audio, Speech, and Language Processing  (Volume:20 ,  Issue: 1 )