Scalable Online-Offline Stream Clustering in Apache Spark | IEEE Conference Publication | IEEE Xplore

Scalable Online-Offline Stream Clustering in Apache Spark


Abstract:

Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring ...Show More

Abstract:

Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: the online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.
Date of Conference: 12-15 December 2016
Date Added to IEEE Xplore: 02 February 2017
ISBN Information:
Electronic ISSN: 2375-9259
Conference Location: Barcelona, Spain

Contact IEEE to Subscribe

References

References is not available for this document.