Skip to Main Content
XML (eXtensible Markup Language) have been adopted by number of software vendors today, it became the standard for data interchange over the web and is platform and application independent also. A XML document is consists of number of attributes like document data, structure and style sheet etc. Clustering is method of creating groups of similar objects. In this paper a weighted similarity measurement approach for detecting the similarity between the homogeneous XML documents is suggested. Using this similarity measurement a new clustering technique is also proposed. The method of calculating similarity of document's structure and styling is given by number of researchers, mostly which are based on tree edit distances. And for calculating the distance between document's contents there are number of text and other similarity techniques like cosine, jaccord, tf-idf etc. In this paper both of the similarity techniques are combined to propose a new distance measurement technique for calculating the distance between a pair of homogeneous XML documents. The proposed clustering model is implemened using open source technology java and is validated experimentally. Given a collection of XML documents distances between documents is calculated and stored in the java collections, and then these distances are used to cluster the XML documents.