Skip to Main Content
Traditional methods for data mining typically make the assumption that the data is centralized, memory-resident, and static. This assumption is no longer tenable. Such methods waste computational and input/output (I/O) resources when data is dynamic, and they impose excessive communication overhead when data is distributed. Efficient implementation of incremental data mining methods is, thus, becoming crucial for ensuring system scalability and facilitating knowledge discovery when data is dynamic and distributed. In this paper, we address this issue in the context of the important task of frequent itemset mining. We first present an efficient algorithm which dynamically maintains the required information even in the presence of data updates without examining the entire dataset. We then show how to parallelize this incremental algorithm. We also propose a distributed asynchronous algorithm, which imposes minimal communication overhead for mining distributed dynamic datasets. Our distributed approach is capable of generating local models (in which each site has a summary of its own database) as well as the global model of frequent itemsets (in which all sites have a summary of the entire database). This ability permits our approach not only to generate frequent itemsets, but also to generate high-contrast frequent itemsets, which allows one to examine how the data is skewed over different sites.
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on (Volume:34 , Issue: 6 )
Date of Publication: Dec. 2004