Skip to Main Content
The presence of a large quantity of unlabeled documents on the web increases, and organizing related heterogeneous XML documents by using their structural and conceptual properties into clusters become a great need. In this paper, we consider the pre-processing step as a key step to improve clustering quality, we propose a new pre-processing method which is based on combining Hapax words and path-based descriptors. A constrained agglomerative clustering method is used, and a comparison between different document representations is performed. The effectiveness of the method is evaluated on the INEX corpus, and clustering quality is measured by using micro and macro average purity measures.