Skip to Main Content
Facing the enormous text on the Internet, automatic topic discovery out of large text corpus becomes an important task for advanced intelligence information analysis, such as opinion recognition, Web user interest analysis, etc. Although many topic mining methods have shown great success in dealing with topic-based analysis tasks, it is desired to discover meaningful topic descriptions for informatics analysis. To avoid words with different granularity to explain a topic, a mechanism for separating text corpus into two subsets with equal semantic topics is proposed. EM algorithm is employed to infer topics models for the subsets. Then a merging process is devised to generate topic descriptions based on the output of EM. Experiments on standard AP text corpus shows that the proposed topic discovery method can achieve better perplexity, which means better ability in predicting topics. Furthermore, a test of topics extraction on a collection of news documents about recent Expo 2010 Shanghai China shows that the description key words in topics are more meaningful and reasonable than that of tradition topic mining method.