By Topic

Meshing Streaming Updates with Persistent Data in an Active Data Warehouse

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

5 Author(s)
Polyzotis, N. ; Comput. Sci. Dept., Univ. of California at Santa Cruz, Santa Cruz, CA ; Skiadopoulos, S. ; Vassiliadis, P. ; Simitsis, A.
more authors

Active data warehousing has emerged as an alternative to conventional warehousing practices in order to meet the high demand of applications for up-to-date information. In a nutshell, an active warehouse is refreshed online and thus achieves a higher consistency between the stored information and the latest data updates. The need for online warehouse refreshment introduces several challenges in the implementation of data warehouse transformations, with respect to their execution time and their overhead to the warehouse processes. In this paper, we focus on a frequently encountered operation in this context, namely, the join of a fast stream 5" of source updates with a disk-based relation R, under the constraint of limited memory. This operation lies at the core of several common transformations such as surrogate key assignment, duplicate detection, or identification of newly inserted tuples. We propose a specialized join algorithm, termed mesh join (MESHJOIN), which compensates for the difference in the access cost of the two join inputs by 1) relying entirely on fast sequential scans of R and 2) sharing the I/O cost of accessing R across multiple tuples of 5". We detail the MESHJOIN algorithm and develop a systematic cost model that enables the tuning of MESHJOIN for two objectives: maximizing throughput under a specific memory budget or minimizing memory consumption for a specific throughput. We present an experimental study that validates the performance of MESHJOIN on synthetic and real-life data. Our results verify the scalability of MESHJOIN to fast streams and large relations and demonstrate its numerous advantages over existing join algorithms.

Published in:

Knowledge and Data Engineering, IEEE Transactions on  (Volume:20 ,  Issue: 7 )