Skip to Main Content
Since data is becoming more and more unstructured, clustering heterogeneous data is essential to getting structured information in response to user queries. In this paper, we test and validate the results of a new clustering technique - clustering by compression - when applied to metadata associated with heterogeneous sets of documents. The clustering by compression procedure is based on a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pair-wise concatenation). Experimental results show that using metadata could improve the average clustering performances with about 10% over clustering the same sample data set without using metadata.