This paper is about the performance of range queries on historic web page data set, i.e. requests into a data set of web pages that keeps record of historic versions of HTML data of URLs on the web for a subset of data, the URLs and the timestamps of which satisfy the query conditions. To keep track of all versions of every web URL, the data set could easily scale up to terabytes. Hence, systems providing query services to such a data set would require much computing resource. We show that in this scenario data storage layout has significant impact on query performance and propose storage design principles for performance improvement through quantitative approaches.
Published in:
ChinaGrid Conference (ChinaGrid), 2010 Fifth Annual
Date of Conference: 16-18 July 2010