A Framework for Clustering Categorical Time-Evolving Data | IEEE Journals & Magazine | IEEE Xplore

A Framework for Clustering Categorical Time-Evolving Data


Abstract:

A fundamental assumption often made in unsupervised learning is that the problem is static, i.e., the description of the classes does not change with time. However, many ...Show More

Abstract:

A fundamental assumption often made in unsupervised learning is that the problem is static, i.e., the description of the classes does not change with time. However, many practical clustering tasks involve changing environments. It is hence recognized that the methods and techniques to analyze the evolving trends for changing environments are of increasing interest and importance. Although the problem of clustering numerical time-evolving data is well-explored, the problem of clustering categorical time-evolving data remains as a challenging issue. In this paper, we propose a generalized clustering framework for categorical time-evolving data, which is composed of three algorithms: a drifting-concept detecting algorithm that detects the difference between the current sliding window and the last sliding window, a data-labeling algorithm that decides the most-appropriate cluster label for each object of the current sliding window based on the clustering results of the last sliding window, and a cluster-relationship-analysis algorithm that analyzes the relationship between clustering results at different time stamps. The time-complexity analysis indicates that these proposed algorithms are effective for large datasets. Experiments on a real dataset show that the proposed framework not only accurately detects the drifting concepts but also attains clustering results of better quality. Furthermore, compared with the other framework, the proposed one needs fewer parameters, which is favorable for specific applications.
Published in: IEEE Transactions on Fuzzy Systems ( Volume: 18, Issue: 5, October 2010)
Page(s): 872 - 882
Date of Publication: 20 May 2010

ISSN Information:


I. Introduction

Many real applications, such as network-traffic monitoring, the stock market, credit card fraud detection, and web click streams, generate continuously arriving data, which are known as data streams [1]. A data stream is a real-time, continuous, ordered (implicitly by arrival time or explicitly by time-stamps) sequence of items. For data-stream applications, it is impossible to control the order in which items arrive, and the volume of data is usually too large to be stored on permanent devices or to be scanned thoroughly more than once. Moreover, the concept of interest may depend on some hidden context, not given explicitly in the form of predictive features. In other words, the concepts, which we try to learn from those data, drift with time. For example, the buying preferences of customers may change with time, depending on the current day of the week, availability of alternatives, discounting rate, etc. As the concepts behind the data evolve with time, the underlying clusters may also change considerably with time. Performing clustering on the entire time-evolving data not only decreases the quality of clusters but also disregards the expectations of users that usually require recent clustering results. It is hence recognized that the methods and techniques to analyze the evolving trends in fast data streams have become very important in recent years [2].

Contact IEEE to Subscribe

References

References is not available for this document.