By Topic

Managing statistical behavior of large data sets in shared-nothing architectures

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Rigoutsos, I. ; IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA ; Delis, A.

Increasingly larger data sets are being stored in networked architectures. Many of the available data structures are not easily amenable to parallel realizations. Hashing schemes show promise in that respect for the simple reason that the underlying data structure can be decomposed and spread among the set of cooperating nodes with minimal communication and maintenance requirements. In all cases, storage utilization and load balancing are issues that need to be addressed. One can identify two basic approaches to tackle the problem. One way is to address it as part of the design of the data structure that is used to store and retrieve the data. The other is to maintain the data structure intact but address the problem separately. The method that we present here falls in the latter category and is applicable whenever a hash table is the preferred data structure. Intrinsically attached to the used hash table is a hashing function that allows one to partition a possibly unbounded set of data items into a finite set of groups; the hashing function provides the partitioning by assigning each data item to one of the groups. In general, the hashing function cannot guarantee that the various groups will have the same cardinality on average, for all possible data item distributions. In this paper, we propose a two-stage methodology that uses the knowledge of the hashing function to reorganize the group assignments so that the resulting groups have similar expected cardinalities. The method is generally applicable and independent of the used hashing function. We show the power of the methodology using both synthetic and real-world databases. The derived quasi-uniform storage occupancy and associated load-balancing gains are significant

Published in:

Parallel and Distributed Systems, IEEE Transactions on  (Volume:9 ,  Issue: 11 )