Skip to Main Content
When dealing with massive quantities of data, top-k queries are a powerful technique for returning only the k most relevant tuples for inspection. There have been several recent attempts to propose definitions and algorithms for ranking queries over probabilistic data. However, these all lack many of the intuitive properties of a top-k over deterministic data. Our observation is that the ranks for a tuple across all possible worlds represent a well-founded distribution of its ranks and this distribution forms the basis of our ranking definition. We studied the ranking definitions based on the expectation, the median and other order statistics of this rank distribution for a tuple and derived the expected rank, median rank and quantile rank correspondingly. We provide efficient solutions to compute such rankings across the major models of uncertain data, such as attribute-level and tuple-level uncertainty. For an uncertain relation of N constant-size tuples, the processing cost for expected rank is O(NlogN)—no worse than simply sorting the relation. The costs for median and quantile ranks are higher, due to dynamic programming. Nevertheless, it is still possible to compute them in low polynomial time. Furthermore, in most cases, we provide pruning techniques that can terminate the search early and guarantee that the top-k has been found.
Knowledge and Data Engineering, IEEE Transactions on (Volume:PP , Issue: 99 )