Skip to Main Content
Latent Dirichlet allocation (LDA) is a new paradigm of topic model which is powerful to capture the latent topic information from natural language. However, the topic information in text streams, e.g. meeting recording, lecture transcription and conversational dialogue, are inherently heterogeneous and nonstationary without explicit boundaries. It is difficult to train a precise topic model from the observed text streams. Furthermore, the usage of words in different paragraphs within a document is varied with different composition styles. In this paper, we present a new hierarchical segmentation model (HSM) where the heterogeneous topic information in stream level and the word variations in document level are characterized. We incorporate the contextual topic information in stream-level segmentation. The topic similarity between sentences is used to form a beta distribution reflecting the prior knowledge of document boundaries in a text stream. The distribution of segmentation variable is adaptively updated to achieve flexible segmentation and is used to group coherent sentences into a topic-specific document. For each pseudo-document, we further use a Markov chain to detect the stylistic segments within a document. The words in a segment are accordingly generated by the same composition style, which differs from the style of the next segment. Each segment is represented by a Markov state, and so the word variations within a document are compensated. The whole model is trained by a variational Bayesian EM procedure and is evaluated on using TDT2 corpus. Experimental results show benefits by using the proposed HSM in terms of perplexity, segmentation error, detection accuracy and F measure.