Skip to Main Content
Most of the traditional Information Retrieval models are based on the assumption that query terms are independent of each other and a document is represented as a bag of words. Nevertheless this assumption may not hold in practice. In this talk, I will discuss how the query terms associate with each other and how to incorporate the term proximity information into the classical probabilistic IR models. I will discuss the relationship between document length and its relevance and how to balance between the Verbosity and Scope hypotheses by modeling document length within the probabilistic weighting model. I will also present how to incorporate this relationship into the classical BM25 models. Through extensive experiments on standard large-scale TREC Web collections, I will show that the extended models are able to markedly outperform the BM25 baseline and at least comparable to the state-of-the-art model. The talk will conclude with a discussion of novel challenges raised in extending probabilistic Information Retrieval and several applications such as promoting diversity in ranking for biomedical IR, sentiment analysis for predicting sales performance and EMR data analysis for effective health care.