Skip to Main Content
Real-time search engines are increasingly indexing web content using data streams, since a number of web sources including news and social media sites are now delivering up-to-date information via streams. Accordingly, it is a crucial challenge for a real-time search engine using data streams to improve index freshness that primarily depends on the latencies involved during fetching and indexing processes. Retrieval latency is a time lag between document publication and fetching while indexing latency is a delay required for a fetched document to be indexed, which is caused by finiteness of indexing capacity. The problem of retrieval latency can be satisfactorily addressed by use of appropriate fetching scheduling or recent real-time content notification protocols. However, as the entire volume of real-time content rapidly grows, the indexing latency becomes a challenging problem. Furthermore, the need for maximizing index coverage makes it more difficult to reduce the indexing latency under the limited indexing capacity. We consider a problem of jointly optimizing the indexing latency as well as indexindexing latency coverage, in which their relative importance can be adjusted, and propose an optimization model based on inventory control theory. Extensive experiments have been conducted to validate the proposed model, and suggest that the proposed approach outperforms the other alternatives.