Loading [MathJax]/extensions/MathMenu.js
Online PLSA: Batch Updating Techniques Including Out-of-Vocabulary Words | IEEE Journals & Magazine | IEEE Xplore

Online PLSA: Batch Updating Techniques Including Out-of-Vocabulary Words


Abstract:

A novel method is proposed for updating an already trained asymmetric and symmetric probabilistic latent semantic analysis (PLSA) model within the context of a varying do...Show More

Abstract:

A novel method is proposed for updating an already trained asymmetric and symmetric probabilistic latent semantic analysis (PLSA) model within the context of a varying document stream. The proposed method is coined online PLSA (oPLSA). The oPLSA employs a fixed-size moving window over a document stream to incorporate new documents and at the same time to discard old ones (i.e., documents that fall outside the scope of the window). In addition, the oPLSA assimilates new words that had not been previously seen (out-of-vocabulary words), and discards the words that exclusively appear in the documents to be thrown away. To handle the new words, Good-Turing estimates for the probabilities of unseen words are exploited. The experimental results demonstrate the superiority in terms of accuracy of the oPLSA over well known PLSA updating methods, such as the PLSA folding-in (PLSA fold.), the PLSA rerun from the breakpoint, the quasi-Bayes PLSA, and the Incremental PLSA. A comparison with respect to the CPU run time reveals that the oPLSA is the second fastest method after the PLSA fold. However, the better accuracy of the oPLSA than that of the PLSA fold. pays off the longer computation time. The oPLSA and the other PLSA updating methods together with online LDA are tested for document clustering and F1 scores are also reported.
Published in: IEEE Transactions on Neural Networks and Learning Systems ( Volume: 25, Issue: 11, November 2014)
Page(s): 1953 - 1966
Date of Publication: 11 February 2014

ISSN Information:

PubMed ID: 25330420

Contact IEEE to Subscribe

References

References is not available for this document.