Skip to Main Content
The goal of online event analysis is to detect events and track their associated documents in real time from a continuous stream of documents generated by multiple information sources. Unlike traditional text categorization methods, event analysis approaches consider the temporal relations among documents. However, such methods suffer from the threshold-dependency problem, so they only perform well for a narrow range of thresholds. In addition, if the contents of a document stream change, the optimal threshold (that is, the threshold that yields the best performance) often changes as well. In this paper, we propose a threshold-resilient online algorithm, called the incremental probabilistic latent semantic indexing (IPLSI) algorithm, which alleviates the threshold-dependency problem and simultaneously maintains the continuity of the latent semantics to better capture the story line development of events. The IPLSI algorithm is theoretically sound and empirically efficient and effective for event analysis. The results of the performance evaluation performed on the topic detection and tracking (TDT)-4 corpus show that the algorithm reduces the cost of event analysis by as much as 15 percent ~ 20 percent and increases the acceptable threshold range by 200 percent to 300 percent over the baseline.