Statistical topic models such as the Latent Dirichlet Allocation (LDA) have emerged as an attractive framework to model, visualize and summarize large document collections in a completely unsupervised fashion. Considering the enormous sizes of the modern electronic document collections, it is very important that these models are fast and scalable. In this work, we build parallel implementations of the variational EM algorithm for LDA in a multiprocessor architecture as well as a distributed setting. Our experiments on various sized document collections indicate that while both the implementations achieve speed-ups, the distributed version achieves dramatic improvements in both speed and scalability. We also analyze the costs associated with various stages of the EM algorithm and suggest ways to further improve the performance.
Published in:
Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on
Date of Conference: 28-31 Oct. 2007