Skip to Main Content
There is often a need to cluster voluminous amounts of data. Such clustering has application in fields such as pattern recognition, data mining, bioinformatics, and recommendation systems. Here we evaluate the performance of 4 clustering algorithms viz. K-means, Fuzzy k-means, Dirichlet, and Latent Dirichlet Allocation within two different cloud runtimes: Hadoop and Granules. Our benchmarks use identical clustering code with both Hadoop and Granules. The difference between these implementations stem from how the Hadoop and Granules runtimes (1) support and manage the lifecycle of individual computations, and (2) how they orchestrate exchange of data between different stages of the computational pipeline during successive iterations of the clustering algorithm. We also include an analysis of our results for each of these clustering algorithms in a distributed setting.