Skip to Main Content
The importance of detecting similar documents grows rapidly as the amount of information increases exponentially. This paper presents a new technique for identifying similar documents. It combines statistical properties of documents with Persian linguistic features. The proposed technique is mostly suited for detecting similar documents in specific fields. The proposed method is built on lexical chain of important words and based on term co-occurrence property of the text. It prevents the irrelevant documents to be identified similar due to polysemy property of the words. It also considers the order of words in identifying the similar documents. If a document consists of more than one subject, it could also be founded and similar documents according to different topics of the text could be detected. Our results shows improved performance compared to existing word-based methods like LSI and VSM.